Text Mining and Retrieval Leiden


Big data in archaeology: harnessing the hidden knowledge in the “graveyard” of Malta reports

PhD project Alex Brandsen

This project will investigate the analysis and indexing of the full corpus of archaeological reports produced over the last 20 years of Malta research, which is more than 60,000 in number and quickly growing. The goal is to establish a visual search and querying service that allows researchers to quickly retrieve the most valuable digital resources, in order to allow them to integrate and synthesise the results into a coherent narrative of the past.

The current focus of the project is to implement Named Entity Recognition to automatically detect archaeological entities (such as artefact, time period, and so on), and integrating these into a search engine. A proof of concept has been built and is currently being used as a starting point for discussion and user requirement solicitation with a representative group of end users.

Supervisors: Karsten Lambers (Archeology), Milco Wansleeben (Archeology), Suzan Verberne (LIACS)

Knowledge Discovery and Data Mining from patient experience repositories

PhD Project Anne Dirkson

This PhD project, funded by the Dutch SIDN fonds, is part of the Patient Forum Miner (PFM) research programme. Patients often share experiences on internet forums. These experiences often contain valuable information for patients, medical specialists and researchers. This information is hidden in an abundance of messages for emotional support. The aim of the PFM programme is to extract the information which is of real value and to formulate hypotheses which can be input for further clinical research.

Supervisors: Suzan Verberne (LIACS), Wessel Kraaij (LIACS)

Measuring relevance and relations of Dutch legal publications

PhD project Gineke Wiggers

Legal scholars and professionals are confronted with a rapidly increasing volume of legal publications. Only part of these publications are relevant enough to be cited. This project aims to determine which documents that are, and whether alternative metrics are a reliable way to predict whether documents will be cited, in order to be able to present the user the most relevant publications first.

Supervisors: Gerrit-Jan Zwenne (Law), Suzan Verberne (LIACS)

Automated text analysis of policy-related documentation

PhD Project Hugo de Vos

The aim of this project is to investigate methods for automatically extracting information from policy documents. Political institution (like the Council of the European Union and the European Parliament) generate large bodies of text. These great amounts of text can impossibly be read by a researcher. Contrary to human researchers, computers are able to read thousands of documents a day. In this research project we look for ways to utilize this ability of computers for the benefit of Political Research. Enlarging the number of documents that can be studied in a project, allows for new questions to be investigated that were impossible to answer before.

Using techniques from Text Mining and Natural Language Processing, we try to search for patterns in the large collections of text created by institutions of the European Union.

Supervisors: Bernard Steunenberg (FGGA), Rik de Ruiter (FGGA), Suzan Verberne (LIACS), Willem Heiser (Statistical Science)

Detecting cross-linguistic syntactic differences automatically

PhD Project Martin Kroon

The main goal of comparative syntactic research is to discover the syntactic principles that all natural languages have in common, but so far it has been impossible to compare large sets of syntactic constructions in large sets of languages systematically and automatically. The online availability of parallel text corpora and software tools to align, enrich, search and analyse them has the potential to make automatic massive systematic cross-linguistic syntactic comparison possible for the first time.

Supervisors: Sjef Barbiers (Humanities), Jan Odijk (Utrecht University), Stéphanie van der Pas (Statistical Science)

Understanding scientific progress by analysing the context of scholarly citations

PhD Project Wout Lamers

The objective of this project is to fundamentally improve our understanding of the ways in which science progresses. Empirical studies have used bibliographic metadata to provide relevant insights, but these studies have failed to tell us how science progresses. Supported by computational advances and improved data access, we propose a large-scale data-driven approach in which scientific progress is studied based on the full text of scientific documents.

Supervisors: Ludo Waltman (CWTS), Nees-Jan van Eck (CWTS), Holger Hoos (LIACS)

Other projects

'Digitale instrumenten voor kennisontsluiting en beleidsontwikkeling rond (zeldzame) kankers'

Voucher project with 4 cancer patient communities, in collaboration with TNO.
In this project we disclose the archives of patient discussion groups, in order to provide access to the valuable experiental knowledge that is contained in those groups. Patients share experiences, and provide informational and emotional support. Opening up the archives through a search interface allows patients, researchers, and medical doctors to find information, make connections, verify suspicions from anecdotal evidence, and generate hypotheses for future research.

'Van toetsenbord naar patiënt'

Betekenisvolle en snelle dossiervorming binnen de sportfysiotherapie, in samenwerking met Codarts, 10 sportfysiotherapiepraktijken en de Nederlandse Vereniging voor Fysiotherapie in de Sportgezondheidszorg (NVFS). (Gefinancierd door RAAK-SIA).

The reach of junknews on Facebook

In collaboration with Nieuwscheckers.