Master's Theses in Data Science

Theses in Data Science are assigned twice a year by the Examination Board in a central process. Outside of this process, we can only assign topics in rare exceptional cases.

We only supervise external theses in exceptional cases if the task fits in well with the research topics of the professorship. Please ask Prof. Schenkel specifically if you have a suggestion for a Master's thesis topic that you would like to work on outside the university.

Examples for recently completed Master's theses

[MT] Generation of recommendations for reviewers of scientific publications

Abstract: Due to the increasing flood of publications, quality assurance of scientific work is playing an increasingly important role. One of the most important methods for quality assurance of scientific work is the so-called peer review process. In this context, the process of selecting a suitable reviewer to review the submitted manuscript is of great importance. However, this process is time-consuming and, if implemented incorrectly, leads to poor reviews. Therefore, the aim of this work is to make the previously described assignment process more efficient and at the same time more objective. This is to be achieved by automating the assignment process. For this purpose, a reviewer recommendation system was developed on the one hand and a classification system was provided on the other. The Reviewer Recommendation System receives a request in the form of a publication as input and suggests a certain number of suitable reviewers. In contrast, the classification system receives a reviewer and a manuscript as input and predicts whether the given reviewer is relevant to the manuscript in question or not. In creating these systems, the effects of different combinations of document representations, similarity measures, levers and voting techniques were also analysed. The results of this work show that both systems can support the assignment process in the peer review process within their use cases. Furthermore, the evaluation of the RR system shows that the tf-idf method in combination with the cosine measure provides the best results. CombSUM TOP 5, CombSUM TOP 10 and Reciprocal Rank were identified as the best performing voting techniques. The evaluation of the classifiers led to the result that the SciBERT classifier achieves a classification accuracy of 80.2 % and thus performs best.

[MT] Methods for resolving references in argument structures in the German language

Abstract: This paper deals with the investigation of systems that are supposed to recognise Named Entities (NE) and references in the German language. The identification of NEs is important in several respects. On the one hand, they can be used to embed additional information from an external source into a text, for example the office of a politician. Secondly, they play a role in recognising references, such as the resolution of personal pronouns. The resolution of references is helpful when only a section of a text is available to a system at the end. To increase its performance, it is advantageous if all references in this section have been resolved correctly. An example of this is the ReCAP project, which processes queries about an assertion and returns isolated nodes containing theses for or against this assertion.

Therefore, in this paper, first a corpus of twelve German texts with educational policy content is elaborated with regard to the NEs and references they contain. Subsequently, three NE systems as well as two coreference resolution systems are evaluated on these twelve texts. The evaluation of these systems is a time-consuming process that can only be automated to a certain extent. This is mainly because the gold standard has been annotated in such a way that an entity has the maximum information content. However, systems often only recognise a partial string; in such cases, manual evaluation is unavoidable.

Accordingly, the final comparison between the systems is also not trivial. In the recognition of NE, a distinction was made between the exact hits and the partial hits between a candidate system and the gold standard. For the exact hits, the Stanford Named Entity Recognizer (NER) comes out ahead with an F1 score of 57.67% and 54.44%, respectively, depending on how the results of the different texts are calculated on average. When partial hits are taken into account, FLAIR comes first with an F1 score of 72.63 % and 67.44 % respectively. However, it would be too simplistic to limit the results to the F1 score alone; the systems have different strengths and weaknesses, such as the recognition of persons. In fact, the Stanford NER performs worst in this category.

In contrast to Named Entity Recognition, the results of Coreference Resolution are weak. CorZu achieves a maximum F1 score of 27.4 % and IMS HotCoref DE a value of 29.1 %. The systems make many references that are no gain, for example {the students - the students}. When these are ignored, the precision increases from 22.86 % to 41.86 % in the best case.

A final examination on isolated text passages in the ReCAP project, in which a resolution of references was carried out manually in the course of the project, shows that these values are insufficient for use in practice.

[BT] Prediciting Paper Impact based on Citation Networks

 - no abstract available -

[BT] Connecting Linked Data and Web APIs using SPARQL

Abstract: Databases are used to store information and it is therefore essential that they are complete. In reality, however, databases have gaps and therefore methods must be used to supplement this missing information. Existing Linked Data systems use interfaces (SPARQL endpoints) for this purpose, which are not provided by all data providers. The common solution in practice is to provide a web API to still be able to request information. In order to be able to supplement missing information via Web APIs, a programme is implemented in this thesis that enables the connection of Linked Data systems and Web APIs. Thus, the programme ExtendedSPARQL developed in this thesis can completely answer a query to the local knowledge base by filling in missing information on-the-fly with the help of external web APIs. In doing so, the programme decides which external Web APIs are relevant for missing information and how to request the external Web APIs. It also decides how to extract the information it is looking for from Web API responses and how to add it to the results of the query. Furthermore, ExtendedSPARQL executes as few Web API requests as possible so that missing information is added with the least effort and redundant information is avoided. It is also easy to use, so that even users with only basic SPARQL knowledge can successfully perform ExtendedSPARQL queries. ExtendedSPARQL also provides a graphical user interface, which makes it even easier to use. In a subsequent evaluation, the programme proved that missing information can be successfully added using external web APIs and that redundant results rarely occur.

[BT] Appropriate Journal Search for Publications

Abstract: Researchers are normally not familiar with the thematic orientation of all journals and conferences in their field of research. As soon as researchers want to publish their work, they face the problem of finding a suitable journal or conference where they want to submit the paper. The aim of this thesis is the development of a recommender system, which can find suitable ones in respect of a given title of a publication. The system is based on data from dblp and Semantic Scholar, which contain titles of publications as well as their abstracts and keywords. Different methods for determining the similarity and relevance of papers were investigated. These include Tf/idf, BM25 and cosine similarity in conjunction with Doc2Vec. Various techniques were analysed in order to find and rank the journals and conferences associated with the corresponding papers. In addition, methods were developed to improve the results of the recommender system, such as looking at the number of citations from journals and conferences. The methods were evaluated automatically and manually. It turned out that cosine similarity with Doc2Vec did not achieve good results in contrast to the other two methods. To improve the usability of the recommender system, a visualisation in form of a web service was implemented.

[BT] A visual query language for SPARQL

Since the development of the Semantic Web by Tim Berners-Lee, more and more information is being published on the internet as Linked Open Data. These are specially designed to be analysed by machines. All elements are given unique identifiers. The elements can then be linked to each other via relations and form ever larger networks. The result is a "Giant Global Graph" in which all things of interest can be referenced. 

But while the amount of data in the SemanticWeb is constantly growing, only a few can use it. Searching for information is difficult because the user needs some prior knowledge. On the one hand, he needs to know how the data in the graph are connected and how they are labelled. On the other hand, they need knowledge about the query language SPARQL, which can be used to make queries to data sources in the Semantic Web. The visual query language developed in this work makes it easier for the user to get started and thus enables even non-experts to search the Semantic Web for information. Instead of a written query, the user graphically constructs a query from prefabricated elements. For this purpose, the Visual Query Builder programme was developed in this work, which implements such a visual query language. By specifying a schema for the respective data endpoint, the user is given the elements he can use. Thus, the user can see which elements exist at all and which attributes they have. The programme developed in this work and the underlying visual query language were then evaluated by a group of test persons. Visual Query Builder was able to prove that it enables both beginners and advanced users to successfully search a data source in the Semantic Web for desired information. In the evaluation, particular attention was paid to the usability of the application. The evaluation showed that the application achieved good results in both test procedures used and was able to prove its effectiveness.

[BT] Hybrid SPARQL queries via Linked Data and Web APIs

Digital libraries, such as dblp or the German National Library (DNB), aim to bring knowledge together online and make it available via the internet. Unfortunately, incomplete data sets are part of the everyday life of a digital library. Missing information, such as titles or author names, could be added using external web APIs. The main problem here is the integration of the external data into the local database, since a common schema, which serves to describe the structure of the data, must first be found. This is the main task of schema integration, which is a subfield of information integration and data migration. The ActiveSPARQL programme designed in this thesis exploits schema integration to use data from Web APIs to answer queries on-the-fly. When a user makes a query to the application, both the data from the local database and the externalWeb-APIs should be used to answer it satisfactorily. Using both sources is called a hybrid request. The design is based on the already existing framework ANGIE. In contrast to this, no wrapper is generated to answer the query, but an extended SPARQL query. In addition, ANGIE requires that the access methods of the web APIs must be declared manually. This step can be automated by the AID4SPARQL programme. This is able to find linkage points between the local and external data and thus ensure that external information is compatible with the local data. The results from AID4SPARQL are prepared in such a way that they can be used as a configuration for communication with web APIs. In addition to ActiveSPARQL, aWeb interface was designed to enable non-experts to create and execute hybrid queries without prior knowledge. Finally, a concept for evaluating the framework is presented, which can be used to compare ANGIE and ActiveSPARQL.