Bachelor's and Master's Theses
Bachelor's and Master's theses can be written in German or, by arrangement, in English.
Topics
In general, we offer topics from the fields of databases, information retrieval and semantic information systems. More precisely, our topics mainly belong to one or more of the areas of searching on semistructured data, integration of heterogeneous information sources, efficiency of large-scale search engines, conversational information retrieval, natural language processing, human computer interaction, data integration, query processing, semantic web, computational argumentation (ranking, clustering, validating and extracting arguments from natural language texts), scholarly recommendation systems, domain-specific query languages and scientometrics.
The topic of a thesis determines which person supervises the thesis. The thematic focus of the advisors can be found on their personal page under Team.
If you are interested in a topic suggested by the chair or if you have your own topic suggestion for a Bachelor's or Master's thesis, please contact Prof. Dr. Ralf Schenkel. If you have already spoken with staff of the chair about a possible topic, please also include this in your email.
Requirements
Please send us a list of your successfully completed modules with your request for a thesis. This overview helps us to assess which possible topic might fit your skills.
For a Bachelor's thesis, we expect that you have already successfully completed the following modules (if included in your module plan as a compulsory module) before you apply for a topic with us, as the content is very helpful for the successful completion of a Bachelor's thesis in our topics: Database Systems (Datenbanksysteme), Non-Relational Information Systems (Nichtrelationale Informationssysteme), CS-Project (Informatik-Projekt or Großes Studienprojekt), Advanced Programming (Fortgeschrittene Programmierung or Programmierung II).
Completed Bachelor's theses
- no abstract available -
- no abstract available -
- no abstract available -
- no abstract available -
- no abstract available -
Completed Master's theses
Abstract: This thesis offers an approach to detect booking duplicates by calculating sentence similarity as an application of Natural Language Processing. These bookings are exports of an accounting software. Among lots of other information, each booking has a booking note which is a short text written by the person who created the booking in the accounting software. The presented approach is part of a larger project in which all booking information is analyzed but in this thesis, solely the textual information of the notes is used for determining the similarity of two bookings. Several models are used for calculating the similarity of booking pairs and their results are compared. One important research objective is the comparison of the TFIDF as an application of the vector space model and language models as BERT and sentenceBERT which are using word and sentence embedding vectors. The best models achieve a F1-score of 0.6004 and an AUC-score of 0.555. Thorough analysis of True Positives, False Positives and False Negatives shows that embedding vectors not only offer advantages but other challenges are a consequence of using word embedding vectors when short texts are analyzed.
Keywords: Natural Language Processing - Duplicate Detection - Accounting - Short Texts
- no abstract available -
- no abstract available -
- no abstract available -
- no abstract available -
Abstract: Argumentation Mining aims at automatically extracting structured arguments from unstructured textual documents. This work addresses the conduction of a cross-lingual argumentation mining task, the detection of argumentative discourse units (ADU)s. Our contribution is two-fold: firstly, we extract a German and French ADU-annotated parallel corpus for further research, secondly, we thereupon compare five state-of-the-art language models (LM)s. Following the CRISP-DM framework for data mining, we prepare the data from the popular Europarl corpus by conducting a topic modeling to semantically trim corpus size. On the French and German subcorpus, annotations are made, distinguishing between the labels “non-argumentative”, “claim” and “premise”. Given the human baseline, in the modeling phase, the five LMs German BERT, German DistilBERT, CamemBERT, mBERT and mDistilBERT are compared on the sentence classification task. The task is performed by the LMs with moderate success. There is a performance difference between German and French models, leading to the insight that considering the input language as a feature and not only a parameter is crucial. Other than that, the beneficial influence of multilingual pretraining is discussed, triggering a need for further research.
Abstract: Due to the increasing flood of publications, quality assurance of scientific work is playing an increasingly important role. One of the most important methods for quality assurance of scientific work is the so-called peer review process. In this context, the process of selecting a suitable reviewer to review the submitted manuscript is of great importance. However, this process is time-consuming and, if implemented incorrectly, leads to poor reviews. Therefore, the aim of this work is to make the previously described assignment process more efficient and at the same time more objective. This is to be achieved by automating the assignment process. For this purpose, a reviewer recommendation system was developed on the one hand and a classification system was provided on the other. The Reviewer Recommendation System receives a request in the form of a publication as input and suggests a certain number of suitable reviewers. In contrast, the classification system receives a reviewer and a manuscript as input and predicts whether the given reviewer is relevant to the manuscript in question or not. In creating these systems, the effects of different combinations of document representations, similarity measures, levers and voting techniques were also analysed. The results of this work show that both systems can support the assignment process in the peer review process within their use cases. Furthermore, the evaluation of the RR system shows that the tf-idf method in combination with the cosine measure provides the best results. CombSUM TOP 5, CombSUM TOP 10 and Reciprocal Rank were identified as the best performing voting techniques. The evaluation of the classifiers led to the result that the SciBERT classifier achieves a classification accuracy of 80.2 % and thus performs best.
Abstract: This paper deals with the investigation of systems that are supposed to recognise Named Entities (NE) and references in the German language. The identification of NEs is important in several respects. On the one hand, they can be used to embed additional information from an external source into a text, for example the office of a politician. Secondly, they play a role in recognising references, such as the resolution of personal pronouns. The resolution of references is helpful when only a section of a text is available to a system at the end. To increase its performance, it is advantageous if all references in this section have been resolved correctly. An example of this is the ReCAP project, which processes queries about an assertion and returns isolated nodes containing theses for or against this assertion.
Therefore, in this paper, first a corpus of twelve German texts with educational policy content is elaborated with regard to the NEs and references they contain. Subsequently, three NE systems as well as two coreference resolution systems are evaluated on these twelve texts. The evaluation of these systems is a time-consuming process that can only be automated to a certain extent. This is mainly because the gold standard has been annotated in such a way that an entity has the maximum information content. However, systems often only recognise a partial string; in such cases, manual evaluation is unavoidable.
Accordingly, the final comparison between the systems is also not trivial. In the recognition of NE, a distinction was made between the exact hits and the partial hits between a candidate system and the gold standard. For the exact hits, the Stanford Named Entity Recognizer (NER) comes out ahead with an F1 score of 57.67% and 54.44%, respectively, depending on how the results of the different texts are calculated on average. When partial hits are taken into account, FLAIR comes first with an F1 score of 72.63 % and 67.44 % respectively. However, it would be too simplistic to limit the results to the F1 score alone; the systems have different strengths and weaknesses, such as the recognition of persons. In fact, the Stanford NER performs worst in this category.
In contrast to Named Entity Recognition, the results of Coreference Resolution are weak. CorZu achieves a maximum F1 score of 27.4 % and IMS HotCoref DE a value of 29.1 %. The systems make many references that are no gain, for example {the students - the students}. When these are ignored, the precision increases from 22.86 % to 41.86 % in the best case.
A final examination on isolated text passages in the ReCAP project, in which a resolution of references was carried out manually in the course of the project, shows that these values are insufficient for use in practice.
- no abstract available -
Abstract: There are many systems for the exploration of bibliographic metadata. However, retrieving and filtering information that is actually relevant often requires complicated search interfaces and long search paths, especially for complex information needs. In this work a web interface for the exploration and visualization of bibliographic metadata is proposed. The core idea is based on a Domain Specific Query Language (DSQL) called SchenQL which aims to be easy to learn and intuitive for domain experts as well as casual users for efficiently retrieving information on bibliographic metadata. This is achieved by using natural sounding keywords and specially designed functions for this particular domain. In addition, the web interface implements useful visualizations of citations and references or co-author relationships. The interface also offers keyword suggestions and an auto completion feature that allows for easily creating SchenQL queries, without having to learn all the keywords of the language beforehand. A three-part user study with 10 students and employees from the field of computer science was conducted where the effectiveness and usability of the SchenQL web interface was evaluated.
- no abstract available -