Text Mining master course 2023-2024

Text Mining 2023-2024

Teacher: Suzan Verberne
Teaching assistants: Amin Abolghasemi, Ektha Mallya, Antonia Christodoulou, Leila Darabi
Contact address: tmcourse@liacs.leidenuniv.nl

Course schedule

The course weeks consist of: a lecture, literature to read, and either a practical exercise (tutorial style) or a hand-in assignment.

The lectures are on Wednesday, 9.00-10.45.
Location: GORL / 01 (Gorlaeus building)

The literature will be distributed on Brightspace. The majority of the chapters comes from this book, abbreviated as J&M in the course schedule below.

J&M: Dan Jurafsky and James H. Martin, Speech and Language Processing (3rd ed), 2023

Week	Lecture	Literature	Exercise / assignment
1 (6 Sept)	Introduction
2 (13 Sept)	Text processing	J&M chapter 2. Regular Expressions, Text Normalization, Edit Distance	Exercise: Chapter 1 of "Advanced NLP with Spacy"
3 (20 Sept)	Vector Semantics	(optional) J&M sections 7.1, 7.2, 7.3. Neural Networks J&M chapter 6. Vector Semantics	Exercise: Word Embedding Tutorial: Word2vec with Gensim
4 (27 Sept)	Text categorization	J&M chapter 4. Naive Bayes Classification	Exercise: Text classification tutorial (sklearn)
5 (4 Oct)	Data collection and annotation	Finin (2010). Annotating Named Entities in Twitter Data with Crowdsourcing McHugh (2012). Interrater reliability: the kappa statistic	Assignment 1. Text classification (deadline 10 October)
6 (11 Oct)	Neural NLP and transfer learning	J&M chapter 10. Transformers and Pretrained Language Models	Exercise: Chapters 2 and 3 of the Huggingface NLP course
(18 Oct)	No lecture
7 (25 Oct)	Information Extraction	J&M chapter 8. Sequence Labeling for Parts of Speech and Named Entities	Exercise: Token classification tutorial in the Huggingface NLP course
8 (1 Nov)	Sentiment analysis & Stance detection	J&M chapter 11. Fine-Tuning and Masked Language Models	Assignment 2. Information Extraction (deadline 7 November)
9 (8 Nov)	Topic Modelling & Text summarization	Zhang et al. (2020). PEGASUS (Abstractive Summarization)	Exercise: Summarization tutorial in the Huggingface NLP course
10 (15 Nov)	Generative large language models	Brown et al (2020). Language Models are Few-Shot Learners
11 (22 Nov)	Industrial Text Mining	Guest lecture by Marzieh Fadaee (Cohere): "How multilingualism shapes LLMs and where to go next"	Paper reading for the final assignment
12 (29 Nov)	Exam preparation session
13 (6 Dec)	Online lab session		Final assignment (deadline 5 January)
(22 Dec)	Exam
(2 Feb)	Re-sit

The assessment of the course consists of a written exam (50% of course grade) and practical assignments (50% of course grade). The practical assignments comprise two smaller assignments (10% each) and one more substantial, final assignment (30%). The grade for the written exam should be 5.5 or higher in order to complete the course. The weighted average grade for the practical assignments should be 5.5 or higher in order to complete the course. If one of the tasks is not submitted the grade for that task is 0. Each assignment has a re-sit opportunity (a later submission). The maximum grade for a re-sit assignment is 6.

Earlier editions of this course

Link to the course page for this course in 2022-2023
Link to the course page for this course in 2021-2022
Link to the course page for this course in 2020-2021
Link to the course page for this course in 2019-2020
Link to the course page for this course in 2018-2019