Abschlussarbeiten in Data Science
Abschlussarbeiten in Data Science werden zweimal im Jahr durch den Prüfungsausschuss in einem zentralen Verfahren vergeben. Außerhalb dieses Prozesses können wir nur in seltenen Ausnahmefällen Themen vergeben.
Externe Abschlussarbeiten betreuen wir nur in Ausnahmefällen, falls die Aufgabenstellung gut zu den Forschungsthemen der Professur passt. Fragen Sie dazu bitte gezielt bei Prof. Schenkel nach, wenn Sie einen Vorschlag für ein Masterarbeitsthema haben, dass Sie außerhalb der Universität bearbeiten wollen.
Beispiele für in unserer Arbeitsgruppe abgeschlossene Masterarbeiten in Data Science
- kein Abstract verfügbar -
Rankings beziehungsweise Rankingsysteme sind inzwischen allgegenwärtig und werden in allen erdenklichen Situationen eingesetzt. Nachdem über Jahrzehnte hinweg Rankingalgorithmen ausschließlich die Nützlichkeit von Dokumenten als alleinigen Faktor in Betracht gezogen haben, gibt es in letzter Zeit immer mehr Ansätze, andere Faktoren miteinzubeziehen, um so fairere und letztendlich potentiell bessere Rankings zu erstellen. In dieser Arbeit werden zwei Fairness-berücksichtigende Rankingalgorithmen für Publikationen aus dblp präsentiert, die Geschlecht und Herkunft der Autoren betrachten, um in Bezug auf diese Faktoren möglichst nützliche und faire Rankings zu erstellen. Bei dem ersten Algorithmus handelt es um eine Implementierung eines bestehenden Algorithmus aus dem TREC 2020 FAIRNESS RANKING TRACK, der zweite Algorithmus ist eine auf diesem basierende Weiterentwicklung, welcher über mehrere Iterationen hinweg dessen Ergebnisse weiter verbessert.
Releasing harmful pollutants like ozone and nitrogen dioxide gas into the atmosphere has been a serious concern in today’s era. Such gases are harmful to the health of humans as well as other species and cause damage to the environment. Therefore, it has become necessary to find ways to monitor the air quality around us. With technological advancement, chemical gas sensors equipped with machine or deep learning algorithms can be employed to detect these gases and their concentrations. However, with the use of such algorithms, there comes a challenge to understand why they made certain predictions in human terms. This thesis aims to address this concern by adopting different explainable artificial intelligence (XAI) approaches for the gas sensors that can help understand the reasons behind the predictions made by the models, allowing not only to understand these models but also improve understanding of sensor behaviour.
- kein Abstract verfügbar -
Hard disk drive failures are a rare event, nonetheless the occurrence of such failures especially in modern data storage centres can result in catastrophic data losses and large monetary costs. To tackle such problems, companies rely on SMART (Self-Monitoring, Analysis and Report Technology) attributes which monitor the state of drives and report upcoming failures. Therefore, this thesis uses the Backblaze public dataset and aims at forecasting hard drive failures that would happen in 1, 10, 20 and 30 days. Also, SMART features were modeled as lag windows to include past values, with one of the window sizes: current day, last five days or last ten days. For forecasting, machine learning algorithms Random Forest, Linear Support Vector Machine and Multi-layer Perceptron were used. These models were also compared and evaluated using raw, normalized and standardized SMART features in order to observe their forecasting abilities. The training and testing of models was set in Azure machine learning, using various Jupyter Notebooks and blob storage for storing the data. Thus, the obtained results showed that models were able to forecast failures that would happen in a further future than in a nearer one. On the other hand, including past feature values by creating lag windows had no significant impact on forecasting performances. Best results were obtained by Linear Support Vector Machine while looking at 30 days into the future and at a lag window including SMART features of the current day only, with an F1-score of 51%. The other two models, namely Random Forest and Linear Support Vector Machine also reported performance increases while looking at 10 and 20 days into the future. Thus, all models performed the worst while looking 1 day into the future. In relation to including past SMART feature values, visible positive impact was not reported considering overall performance of the models. Moreover, in relation to data normalization and standardization, Linear Support Vector Machine reported only a slight increase in its performance, whereas Random Forest did not visibly increase in its performance at all. In regard to Multi-layer Perceptron, while training on raw SMART features resulted in a F1-Score of 0%, using standardized data, brought increased outcomes in the forecasting performance. Overall, Linear Support Vector Machine reported the best hard drive failure forecasting results in comparison to Random Forest and Multi-layer Perceptron and therefore is considered as the best forecasting model in this thesis.
- kein Abstract verfügbar -
In todays digital world, there exist many large scale databases consisting of texts. Thus it is important to have ways, in which data, a user is interested in, can be retrieved. One such way is presented by topic modeling algorithms, which automatically generate topics over a dataset and are then able to present documents as mixtures of these topics. This way, a user can filter publications based on the predominant topics they contain. However this approach works on the document level. The leading question in this thesis, is, how these models can be used to yield topical compositions of collections consisting of different documents. Specifically we take a look at scientific conferences from the field of computer science. This thesis presents a way to model those conferences as topic vectors. We then evaluate if these topic vectors share some similarity when the corresponding conferences are belonging to the same subfield of computer science. In order to do that we use clustering techniques to find groups of similar conferences based on our topical modeling and compare the obtained clustering with a golden dataset that groups conferences into subfields. This comparison is done using the rand index. Our results show a strong similarity between the golden clustering and the one obtained by our approach.
Stock movement prediction is a challenging task due to the characteristics of the stock market. However, it is a field where people can gain high returns with patience and a basic understanding of the stock market. Many previous studies have been conducted on predicting stock price movements using statistical techniques such as ARMA, ARIMA etc. In the era of social media, recent research on stock price movements has mostly focused on tweets, financial news, and company earnings calls. In this study, we concentrate on stock market prediction using news headlines. We construct models to predict the Dow Jones Industrial Average Index price using a single day’s top 25 news headlines. Our target variable is a binary variable, represented by 0 and 1. We create the target variable using the Dow Jones Industrial Average Adjusted Close Price. If the Adjusted Close Price increases or remains the same, we label it 1; otherwise, we label it 0. In our baseline model, we first concatenate all 25 news headlines to create a whole sentence. Then we preprocess the news text by performing steps such as removing punctuation, lemmatizing, and identifying named entities. After that, we apply conventional vectorisers such as CountVectorizer and TF-IDF vectorizer to extract numerical values from the text. We then use standard algorithms for the training and testing data. Instead of using CountVectorizer and TF-IDF in our next model, we employ word embedding models like GloVe and Word2Vec. Our third model uses a state-of-the-art BERT embedding layer instead of GloVe and Word2Vec. In our final model, we adopt a novel approach that combines the BERT embedding layer with various text stylistic features and sentiment scores such as positivity, negativity, and compound to predict stock price movements. In this thesis, we achieved an accuracy of approximately 59% in predicting stock price movements.
Integrating statistical and machine learning (ML) and deep learning (DL) techniques is quickly gaining popularity in different business sectors, inventory management, marketing, and financial planning, all so businesses can attain a competitive edge in the market by effectively directing their resources and identifying the opportunities and challenges in their processes. This study delves into the application of various statistical, ML, and DL models to predict both a company’s monthly turnover and individual project turnovers. Utilizing all years of non-truncated data, models such as SARIMAX, Prophet, SimpleFeedForward, and DeepAR were rigorously trained, evaluated, and backtested. Results showcased SARIMAX’s higher predictive accuracy, with the SimpleFeedForward model training. For project-level forecasting, the data was transformed into lagged datasets, aggregated with unique project features. Using models like Decision Tree, Random Forest, Gradient Boosting Regressor, and XGBoost, the study unearthed intriguing insights. While initial trials with 5-lags were underwhelming, extending to 10 and 15 lags progressively improved performance, culminating in an outstanding average MAPE of approximately 5% at 20 lags. Furthermore, the adaptive, perfective and explainable aspects of the developed forecasting tool makes it simple for anyone to replicate the results or choose to repeat the process with a different dataset.
This thesis is conducted in collaboration with Ramboll, a renowned global firm specializing in architecture, engineering, and consultancy services. The Department of Energy in Hamburg has actively engaged in various projects related to offshore wind energy. Aligned with these endeavors, the present research topic emerged to address the ongoing need for estimating structural health by predicting fatigue using Supervisory Control and Data Acquisition (SCADA) data. The primary focus is to leverage data collected from structures equipped with both SCADA sensors and strain gauges and employ models to estimate fatigue on other structures with no strain gauges.
The subsequent chapters delve into in-depth discussions on data, preprocessing, feature selection, and machine learning, shedding light on their operational mechanisms. The rationale behind utilizing specific machine learning models, such as Long Short-Term Memory (LSTM), Bidirectional Long Short-Term Memory (BiLSTM), Extreme Gradient Boosting (XGBoost), and AutoRegressive Integrated Moving Average (ARIMA), is explored. The evaluation of these models provides an assessment of their performance, efficiency, and accuracy, offering an understanding of why certain models are better suited for fatigue prediction.
Argumentation can be understood as the activity of using arguments to convince, agree, or disagree people with people about a point of view. In our daily lives, argumentation is one of the most common behaviors in applying natural language. For example, social media users would respond to controversial topics using their stances and opinions. The collection and analysis of user ideas are critical to studying social phenomena and trends. However, it is hard to analyze all collected arguments since processing enormous data size needs much time and human costs, which is undesirable. This requires more efficient methods. A possible solution might be the research in computational argumentation because computers can handle numerous data efficiently. Besides social phenomenon analysis, other areas such as business and linguistics also benefit from studying argumentation.
Computational argumentation is a growing research field that yield many new methods in this area. This work is inspired by a study investigating in transforming natural language texts to argument graphs. In this thesis, we base on the previous studies and explore deep into the steps of each part, including classifying major claims, inferring relations between statements, and constructing argument graphs, and investigate in approaches for improvement. We propose a new method in major claim classification, which is to find the statement describing the core idea of the discussion, and obtain an excellent enhancement. Moreover, we introduce state-of-the-art methods to estimate the relations between arguments. We suggest six methods in the step of argument graph construction, which also give satisfactory results. There are some limitations to our research. We discuss them and explore some possible further improvements for achieving a better result in the future studies.
An information retrieval system’s purpose is to return results that are relevant to the user’s query. Information relevant to the user’s request may not exist in the user’s native language in some instances. It’s also possible that the user can read papers in languages other than his or her native tongue but has trouble forming inquiries in them. The primary goal of Multilingual Information Extraction is to locate the most relevant information accessible, regardless of the query language.
Artificial intelligence (AI) has become an increasingly popular research field in recent years. Similarly, Natural Language Processing (NLP) has become an important point of discussion. Neural networks, do exceptionally well in this field. The speed and performance of neural networks dealing with diverse NLP tasks have been greatly enhanced due to a variety of effective learning methods and technologies.
The recent advances in NLP transfer learning have resulted in powerful models, mostly from the tech giants like Google, Facebook, Microsoft, etc. which perform well on NLP tasks in the general domain. In this thesis, we are going to fine-tune multilingual transformer models for the domain of engineering data both in English and German Languages. Hence, we need a language independent model - which can able to learn it’s parameters (weights and bias) of any language-specific features. First, we will describe how multilingual transfer is implemented, with the focus on state-of-the-art transformer models. Then, in the methodology part, we leverage our engineering domain data of English-German languages to fine-tune multilingual transformer models.
- kein Abstract verfügbar -
Abstract: Argumentation is considered to be a foundational discipline. Initially, its objectives are to foster critical thinking and logical reasoning, to reach a resolution when people disagree, persuade or convince others of a particular viewpoint or position, and also can be a tool for knowledge exchange.
Individuals can explore arguments that either support or attack their own opinions, leveraging their personal knowledge and life experiences, but they also can use search engines (e.g., Google) accessed by the Internet. In this work, we focus on the arguments taken from the Web. The user could ask (input the query) the search engine a particular question, e.g., “Should I own a dog?” and will expect to receive an answer in the form of a list of Web pages (sorting by relevance), textual information, images, videos, news articles, and social media’s posts.
Usually, arguments for a specific question are in the text, which is a part of the Web page (also called “document”). The document may contain argumentative and non-argumentative text spans. The aim is to retrieve the documents, such that their argumentative parts are relevant to the query and highly qualified (argumentative). However, there is the issue that the retrieved documents may consist of arguments with low relevance to the query, low quality, or falsified, and there is usually no clear stance. Therefore, these documents will not satisfy the user’s expectations, or the user will use the wrong, fake, biased arguments to support the position.
The problem with search engines like Google is that users looking for reasonable arguments within a short time are required to do a significant amount of work after submitting their query. This work includes tasks such as reading pages, identifying arguments, filtering duplicates, and manually ranking them. In contrast, argument search engines aim to alleviate this burden by handling these tasks for users and presenting them with the best arguments. This proves advantageous in debates, interviews, and political discussions, as it ensures the availability of the strongest arguments for making informed decisions.
Our work was inspired by the Touché Lab Task 1 named “Argument Retrieval for Controversial Questions”, whose objective is to retrieve and rank documents by relevance to the topic, by argumentativeness of the documents (quality), and to detect their stance towards the topic. In this work, we investigate various methods and techniques for argument mining (i.e., automatic extraction of arguments from the document) and preprocessing for the purpose of working with individual arguments from the document rather than the entire text as a whole. We applied stance classification (i.e., determining whether the premise supports or attacks the specific claim) and quality prediction to get high-quality arguments 1 . To expand the search for the re-ranking model, we utilize query augmentation, which is performed with the assistance of ChatGPT. The primary objective is to optimally combine these approaches to retrieve highly relevant results with high-quality arguments and demonstrate that working with individual arguments produces better results than working with the entire text.
For our experiments and evaluation, we utilize several datasets and resources. The “ClueWeb22-B” corpus and controversial questions provided by the Touch´e Lab served as the basis for our analysis. The SNLI dataset is utilized to establish relations between claims and premises. At the same time, the “args.me” dataset is explicitly employed for stance classification. To predict the argument’s quality, we rely on the “Webis-ArgQuality-20” and “IBM-ArgQ-Rank-30kArgs” datasets.
To evaluate the effectiveness of our approach, we compare our results with the baseline of Touché Task 1. To ensure fair comparisons, we utilize manually annotated judgments as a benchmark for both our results and the baselines. Our approach demonstrates superior performance in the nDCG measurement compared to the baseline of Touché Lab Task 1 and achieves an accuracy of 0.54 for stance classification. It highlights the effectiveness and competitiveness of our approach in retrieving and ranking relevant arguments by relevance and quality, as well as classifying them by stance.
Abstract: This thesis offers an approach to detect booking duplicates by calculating sentence similarity as an application of Natural Language Processing. These bookings are exports of an accounting software. Among lots of other information, each booking has a booking note which is a short text written by the person who created the booking in the accounting software. The presented approach is part of a larger project in which all booking information is analyzed but in this thesis, solely the textual information of the notes is used for determining the similarity of two bookings. Several models are used for calculating the similarity of booking pairs and their results are compared. One important research objective is the comparison of the TFIDF as an application of the vector space model and language models as BERT and sentenceBERT which are using word and sentence embedding vectors. The best models achieve a F1-score of 0.6004 and an AUC-score of 0.555. Thorough analysis of True Positives, False Positives and False Negatives shows that embedding vectors not only offer advantages but other challenges are a consequence of using word embedding vectors when short texts are analyzed.
Keywords: Natural Language Processing - Duplicate Detection - Accounting - Short Texts
- kein Abstract verfügbar -
- kein Abstract verfügbar -
- kein Abstract verfügbar -
- kein Abstract verfügbar -
Abstract: Argumentation Mining aims at automatically extracting structured arguments from unstructured textual documents. This work addresses the conduction of a cross-lingual argumentation mining task, the detection of argumentative discourse units (ADU)s. Our contribution is two-fold: firstly, we extract a German and French ADU-annotated parallel corpus for further research, secondly, we thereupon compare five state-of-the-art language models (LM)s. Following the CRISP-DM framework for data mining, we prepare the data from the popular Europarl corpus by conducting a topic modeling to semantically trim corpus size. On the French and German subcorpus, annotations are made, distinguishing between the labels “non-argumentative”, “claim” and “premise”. Given the human baseline, in the modeling phase, the five LMs German BERT, German DistilBERT, CamemBERT, mBERT and mDistilBERT are compared on the sentence classification task. The task is performed by the LMs with moderate success. There is a performance difference between German and French models, leading to the insight that considering the input language as a feature and not only a parameter is crucial. Other than that, the beneficial influence of multilingual pretraining is discussed, triggering a need for further research.
Abstract: Aufgrund der zunehmenden Publikationsflut spielt die Qualitätssicherung von wissenschaftlichen Arbeiten eine immer größere Rolle. Eine der wichtigsten Methoden zur Qualitätssicherung wissenschaftlicher Arbeiten ist das sogenannte Peer-Review Verfahren. In diesem Zusammenhang ist der Prozess zur Auswahl eines geeigneten Reviewers zur Begutachtung des eingereichten Manuskripts von großer Wichtigkeit. Dieser Prozess ist jedoch aufwendig und führt bei inkorrekter Umsetzung zu schlechten Gutachten. Daher ist das Ziel dieser Arbeit, den zuvor beschriebenen Zuweisungsprozess effizienter und zugleich objektiver zu gestalten. Dies soll durch eine Automatisierung des Zuweisungsprozesses erreicht werden. Dazu wurde einerseits ein Reviewer Recommendation System entwickelt und andererseits ein Klassifikationssystem bereitgestellt. Das Reviewer Recommendation System erhält als Eingabe eine Anfrage in Form einer Publikation und schlägt dazu eine bestimmte Anzahl an passenden Reviewern vor. Im Gegensatz dazu erhält das Klassifikationssystem als Eingabe einen Reviewer sowie ein Manuskript und sagt voraus, ob der gegebene Reviewer relevant für das jeweilige Manuskript ist oder nicht. Bei der Erstellung dieser Systeme wurden zudem die Auswirkungen verschiedener Kombinationen von Dokumentrepräsentationen, Ähnlichkeitsmaßen, Hebeln und Voting-Techniken analysiert. Die Ergebnisse dieser Arbeit zeigen, dass beide Systeme im Rahmen ihrer Anwendungsfälle den Zuweisungsprozess im Peer-Review Verfahren unterstützen können. Des Weiteren zeigt die Evaluation des RR-Systems, dass das tf·idf-Verfahren in Kombination mit dem Kosinusmaß die besten Ergebnisse liefert. Als performanteste Voting-Techniken konnten CombSUM TOP 5, CombSUM TOP 10 und Reciprocal Rank identifiziert werden. Die Evaluation der Klassifikatoren führte zu dem Ergebnis, dass der SciBERT-Klassifikator eine Klassifikationsgenauigkeit von 80,2 % erreicht und somit am besten performt.
Abstract: Diese Arbeit befasst sich mit der Untersuchung von Systemen, welche Named Entities (NE) und Referenzen in der deutschen Sprache erkennen sollen. Die Bestimmung von NE ist in mehreren Punkten wichtig, zum einen können dadurch zusätzliche Informationen aus einer externen Quelle in einen Text eingebettet werden, beispielsweise das Amt eines Politikers. Zum anderen spielen sie eine Rolle beim Erkennen von Referenzen, wie beispielsweise dem Auflösen von Personalpronomen. Die Auflösung von Referenzen ist hilfreich, wenn lediglich ein Ausschnitt eines Textes einem System am Ende zur Verfügung steht. Um dessen Performanz zu erhöhen, ist es von Vorteil, wenn in diesem Ausschnitt sämtliche Referenzen korrekt aufgelöst worden sind. Ein Beispiel hierfür ist das ReCAP-Projekt, welches Anfragen zu einer Behauptung verarbeitet und isolierte Knoten, welche Thesen enthalten, für oder gegen diese Behauptung zurückliefert.
Daher wird in dieser Arbeit zuerst ein Korpus aus zwölf deutschen Texten mit bildungspolitischem Inhalt hinsichtlich der in ihr enthaltenen NE und Referenzen erarbeitet. Anschließend werden drei NE-Systeme, sowie zwei Coreference Resolution Systeme auf diesen zwölf Texten bewertet. Die Bewertung dieser Systeme ist ein aufwändiger Prozess, der nur zu einem gewissen Teil automatisierbar ist. Dies liegt vor allem daran, dass der Goldstandard in einer Art und Weise annotiert wurde, so dass eine Entität den maximalen Informationsgehalt besitzt. Systeme erkennen häufig jedoch lediglich einen Teilstring, in solchen Fällen ist eine händische Auswertung unumgänglich.
Entsprechend ist ebenfalls der abschließende Vergleich unter den Systemen nicht trivial. Bei der Erkennung von NE wurde zwischen den exakten Treffern und den teilweisen Treffern zwischen einem Kandidatensystem und dem Goldstandard unterschieden. Bei den exakten Treffern liegt der Stanford Named Entity Recognizer (NER) mit einem F1-Score von 57,67 % bzw. 54,44 %, abhängig von der Art und Weise, wie die Resultate der unterschiedlichen Texte im Mittel berechnet werden, vorne. Unter der Berücksichtigung von teilweisen Treffern belegt FLAIR den ersten Platz mit einem F1-Score von 72,63 % bzw. 67,44 %. Es wäre jedoch zu einfach, sich bei den Resultaten lediglich auf den F1-Score zu beschränken, die Systeme haben unterschiedliche Stärken und Schwächen, wie zum Beispiel der Erkennung von Personen. In dieser Kategorie schneidet der Stanford NER nämlich am schlechtesten ab.
Im Gegensatz zur Named Entity Recognition schneiden die Resultate der Coreference Resolution schwach ab. CorZu erreicht maximal einen F1-Score von 27,4 % und IMS HotCoref DE einen Wert von 29,1 %. Die Systeme bilden viele Referenzen, die keinen Gewinn darstellen, zum Beispiel { die Schüler - die Schüler}. Wenn diese ignoriert werden, steigt die Precision im besten Fall von 22,86 % auf 41,86 %.
Eine abschließende Untersuchung auf isolierten Textstellen im ReCAP-Projekt, in welcher im Laufe des Projektes manuell eine Auflösung von Referenzen durchgeführt wurde, zeigt, dass diese Werte unzureichend für einen Einsatz in der Praxis sind.
- kein Abstract verfügbar -
Abstract: There are many systems for the exploration of bibliographic metadata. However, retrieving and filtering information that is actually relevant often requires complicated search interfaces and long search paths, especially for complex information needs. In this work a web interface for the exploration and visualization of bibliographic metadata is proposed. The core idea is based on a Domain Specific Query Language (DSQL) called SchenQL which aims to be easy to learn and intuitive for domain experts as well as casual users for efficiently retrieving information on bibliographic metadata. This is achieved by using natural sounding keywords and specially designed functions for this particular domain. In addition, the web interface implements useful visualizations of citations and references or co-author relationships. The interface also offers keyword suggestions and an auto completion feature that allows for easily creating SchenQL queries, without having to learn all the keywords of the language beforehand. A three-part user study with 10 students and employees from the field of computer science was conducted where the effectiveness and usability of the SchenQL web interface was evaluated.
- kein Abstract verfügbar -