Text Mining 2021-2022

Teacher: Suzan Verberne
Teaching assistants: Michiel van der Meer, Cheyenne Health, Hainan Yu, Juan Bascur Cifuentes
Contact address: tmcourse@liacs.leidenuniv.nl







Course schedule

The course weeks consist of: a lecture, literature to read, and either a practical exercise (tutorial style) or a hand-in assignment. The lectures are on Wednesday, 9.15-11.00, and scheduled to be on-campus /hybrid.

Location: Sitterzaal, Huygens building (Oort entrance, across Snellius). The maximum number of students in the room is 75. In practice, that will most likely mean that everyone who wants and is able to come to the lecture, can. The lectures are livestreamed on the university's Mediasite. For interaction (without sound) we use a Zoom room (shared on Brightspace).

The literature will be distributed on Brightspace. The majority of the chapters comes from this book, abbreviated as J&M in the course schedule below.


WeekLectureLiteratureExercise / assignment
1 (8 Sept)Introduction (slides)
2 (15 Sept)Text processing (slides)J&M chapter 2. Regular Expressions, Text Normalization, Edit DistanceExercise: Chapter 1 of "Advanced NLP with Spacy"
3 (22 Sept)Vector SemanticsJ&M chapter 6. Vector SemanticsExercise: Word Embedding Tutorial: Word2vec with Gensim
4 (29 Sept)Text categorizationJ&M chapter 4.1-4.3. Naive Bayes ClassificationExercise: Text classification tutorial (sklearn)
5 (6 Oct)Data collection and annotationFinin (2010). Annotating Named Entities in Twitter Data with Crowdsourcing
McHugh (2012). Interrater reliability: the kappa statistic
Assignment 1. Text classification (deadline 18 Oct)
(13 Oct)No lecture
6 (20 Oct)Information ExtractionJ&M chapter 18. Information ExtractionExercise: Sequence labelling tutorial (crfsuite)
(26 Oct)No lecture
7 (3 Nov)Neural NLP and transfer learningJ&M chapter 7. Neural Nets and Neural Language ModelsExercise: BERT Fine-Tuning with Huggingface
8 (10 Nov)Text summarization Kryściński et al (2019). Neural Text Summarization: A Critical EvaluationAssignment 2. Information Extraction (deadline 15 Nov)
9 (17 Nov)Sentiment analysisExercise: Sentiment analysis with BERT
10 (24 Nov)Biomedical text miningLee et al. (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining
11 (1 Dec)Industrial Text Mining: guest lecture Paper reading for the final assignment
12 (8 Dec)ConclusionsFinal assignment: multiple topics to choose from (deadline 16 Jan)
(13 Jan)Exam
(4 Feb)Re-sit


The assessment of the course consists of a written exam (50% of course grade) and practical assignments (50% of course grade). The practical assignments comprise two smaller assignments (10% each) and one more substantial, final assignment (30%). The grade for the written exam should be 5.5 or higher in order to complete the course. The weighted average grade for the practical assignments should be 5.5 or higher in order to complete the course. If one of the tasks is not submitted the grade for that task is 0.


Earlier editions of this course

Link to the course page for this course in 2020-2021
Link to the course page for this course in 2019-2020
Link to the course page for this course in 2018-2019