Text Mining and Retrieval Leiden


Big data in archaeology: harnessing the hidden knowledge in the “graveyard” of Malta reports

PhD project Alex Brandsen

This project will investigate the analysis and indexing of the full corpus of archaeological reports produced over the last 20 years of Malta research, which is more than 60,000 in number and quickly growing. The goal is to establish a visual search and querying service that allows researchers to quickly retrieve the most valuable digital resources, in order to allow them to integrate and synthesise the results into a coherent narrative of the past.

The current focus of the project is to implement Named Entity Recognition to automatically detect archaeological entities (such as artefact, time period, and so on), and integrating these into a search engine. A proof of concept has been built and is currently being used as a starting point for discussion and user requirement solicitation with a representative group of end users.

Supervisors: Karsten Lambers (Archeology), Milco Wansleeben (Archeology), Suzan Verberne (LIACS)

Knowledge Discovery and Data Mining from patient experience repositories

PhD Project Anne Dirkson

This PhD project, funded by the Dutch SIDN fonds, is part of the Patient Forum Miner (PFM) research programme. Patients often share experiences on internet forums. These experiences often contain valuable information for patients, medical specialists and researchers. This information is hidden in an abundance of messages for emotional support. The aim of the PFM programme is to extract the information which is of real value and to formulate hypotheses which can be input for further clinical research.

Supervisors: Suzan Verberne (LIACS), Hans Gelderblom (LUMC), Wessel Kraaij (LIACS)

Measuring relevance and relations of Dutch legal publications

PhD project Gineke Wiggers

Legal scholars and professionals are confronted with a rapidly increasing volume of legal publications. Only part of these publications are relevant enough to be cited. This project aims to determine which documents that are, and whether alternative metrics are a reliable way to predict whether documents will be cited, in order to be able to present the user the most relevant publications first.

Supervisors: Gerrit-Jan Zwenne (Law), Suzan Verberne (LIACS)

Automated text analysis of policy-related documentation

PhD Project Hugo de Vos

The aim of this project is to investigate methods for automatically extracting information from policy documents. Political institution (like the Council of the European Union and the European Parliament) generate large bodies of text. These great amounts of text can impossibly be read by a researcher. Contrary to human researchers, computers are able to read thousands of documents a day. In this research project we look for ways to utilize this ability of computers for the benefit of Political Research. Enlarging the number of documents that can be studied in a project, allows for new questions to be investigated that were impossible to answer before.

Using techniques from Text Mining and Natural Language Processing, we try to search for patterns in the large collections of text created by institutions of the European Union.

Supervisors: Bernard Steunenberg (FGGA), Rik de Ruiter (FGGA), Suzan Verberne (LIACS), Willem Heiser (Statistical Science)

Detecting cross-linguistic syntactic differences automatically

PhD Project Martin Kroon

The main goal of comparative syntactic research is to discover the syntactic principles that all natural languages have in common, but so far it has been impossible to compare large sets of syntactic constructions in large sets of languages systematically and automatically. The online availability of parallel text corpora and software tools to align, enrich, search and analyse them has the potential to make automatic massive systematic cross-linguistic syntactic comparison possible for the first time.

Supervisors: Sjef Barbiers (Humanities), Jan Odijk (Utrecht University), Stéphanie van der Pas (Statistical Science)

Understanding scientific progress by analysing the context of scholarly citations

PhD Project Wout Lamers

The objective of this project is to fundamentally improve our understanding of the ways in which science progresses. Empirical studies have used bibliographic metadata to provide relevant insights, but these studies have failed to tell us how science progresses. Supported by computational advances and improved data access, we propose a large-scale data-driven approach in which scientific progress is studied based on the full text of scientific documents.

Supervisors: Ludo Waltman (CWTS), Nees-Jan van Eck (CWTS), Holger Hoos (LIACS)

Minimal structure modeling

PhD Project Prajit Dhar

Existing work in probabilistic language modeling can be mostly divided into two categories: (i) Purely sequential, string-level approaches ensure fluency at the local level without notion of grammaticality and seek improvements in the use of massive training corpora. (ii) Fully structural, tree-based approaches model text as the realization of latent tree structures that encode complex grammatical dependencies. This project explores a third way where only structural relations required to produce grammatical sentences in a specific language and task are modeled.

Supervisors: Arianne Bisazza (LIACS), Wessel Kraaij (LIACS)

Digital tools for knowledge extraction for (rare) cancers

Voucher project funded by the Ministry of Health. With 4 cancer patient communities, in collaboration with TNO.
In this project we disclose the archives of patient discussion groups, in order to provide access to the valuable experiental knowledge that is contained in those groups. Patients share experiences, and provide informational and emotional support. Opening up the archives through a search interface allows patients, researchers, and medical doctors to find information, make connections, verify suspicions from anecdotal evidence, and generate hypotheses for future research.

SmartFile: from keyboard to patient

Funded by RAAK-SIA, in collaboration with Hogeschool Codarts, the startup company 'SmartFile', 10 sports physiotherapy practices and the Dutch Association for Physical Therapy in Sport Healthcare (NVFS).

Practices for sports physiotherapy depend on health insurers when determining the treatment rates. For those rates, extensive documentation is required, and that takes time and money. Every day, therapists spend at least 1 to 2 hours on administrative tasks. This project, .

The project aims to "Improve the management of sports physiotherapy practices in the Netherlands by investigating how documentation can be performed faster and more meaningfully"". The project addresses three research questions: (1) How can language processing technology be used in sports physiotherapy practices to automatically extract a structured patient record from written text? (2) How can information from the automatically extracted patient records be used as feedback to physical therapy practices? (3) How should the software for faster and meaningful documentation be designed to meet the wishes, requirements and activities of physical therapists?

The follow-up project, 'Learning from registration', also funded by RAAK-SIA, will focus on the feedback of treatment data to improve the quality of physical therapy care. Text mining, visualization and human-computer interaction play a central role in both projects. The end product of these projects is a new software application for sports physiotherapists. The software product enables meaningful and fast documentation of patient records.

The reach of junknews on Facebook

In this project, a collaboration with Nieuwscheckers, we study the reach of junk news on Facebook. Junk news is the money-driven, low-quality, highly shareable kind of content that is typically distributed on social media as clickbait. We compare the reach and development of commercially motivated Dutch junk news on Facebook to the reach and development of Dutch mainstream news on Facebook.