Text Mining 2020-2021

Teacher: Suzan Verberne
Teaching assistants: Michiel van der Meer, Jeroen Rook, Jan van Staalduinen, Mitch Angenent
Contact address: tmcourse@liacs.leidenuniv.nl







Course schedule

The course weeks consist of: a lecture, literature to read, and either a practical exercise (tutorial style) or a hand-in assignment. The lectures are on Wednesday, 9.15-11.00, online in Kaltura (URL has been published in Brightspace).

The literature will be distributed on Brightspace. The majority of the chapters comes from this book, abbreviated as J&M in the course schedule below.


WeekLectureLiteratureExercise / assignment
1 (2 Sept)IntroductionInformal break-out
2 (9 Sept)Text processingJ&M chapter 2. Regular Expressions, Text Normalization, Edit DistanceExercise: Chapter 1 of "Advanced NLP with Spacy"
3 (16 Sept)Vector SemanticsJ&M chapter 6. Vector SemanticsExercise: Tutorial about the semantics of word embeddings
4 (23 Sept)Text categorizationJ&M chapter 4.1-4.3. Naive Bayes Classification
Zhai & Massung chapter 15. Text categorization
Exercise: Text classification tutorial (sklearn)
5 (30 Sept)Data collection and annotationFinin (2010). Annotating Named Entities in Twitter Data with Crowdsourcing
McHugh (2012). Interrater reliability: the kappa statistic
Assignment 1. Text classification (deadline 12 Oct)
6 (7 Oct)Neural NLP and transfer learningJ&M chapter 7. Neural Nets and Neural Language ModelsExercise: BERT Fine-Tuning with Huggingface
7 (14 Oct)Information ExtractionJ&M chapter 18. Information ExtractionExercise: Sequence labelling tutorial (crfsuite)
(21 Oct)No lecture
8 (28 Oct)Text summarization Kryściński et al (2019). Neural Text Summarization: A Critical EvaluationAssignment 2. Information Extraction (deadline 2 Nov)
9 (4 Nov)Sentiment analysis Select paper for literature assignment (see below)
10 (11 Nov)Biomedical text miningFleuren & Alkema (2015). Application of text mining in the biomedical domainExercise: Sentiment analysis with BERT
11 (18 Nov)Paper presentations (groups)Selection of recent literatureAssignment 3: literature (presentation 18 Nov)
12 (25 Nov)Industrial Text Mining: guest lecture by Mihai Rotaru, TextKernel Papers for the final assignment (3 papers, one per benchmark task)
13 (2 Dec)ConclusionsFinal assignment: multiple tasks to choose from (deadline 11 Jan)


The assessment of the course consists of a written exam (50% of course grade) and practical assignments (50% of course grade). The practical assignments comprise three smaller assignments (10% each) and one more substantial, final assignment (20%). The grade for the written exam should be 5.5 or higher in order to complete the course. The weighted average grade for the practical assignments should be 5.5 or higher in order to complete the course. If one of the tasks is not submitted the grade for that task is 0.


Earlier editions of this course

Link to the course page for this course in 2019-2020
Link to the course page for this course in 2018-2019



Papers for the literature assignment

(choose 1 per group from the list)
  • Fundamental: Tenney, I., Das, D., & Pavlick, E. (2019, July). BERT Rediscovers the Classical NLP Pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 4593-4601). https://www.aclweb.org/anthology/P19-1452.pdf
  • Evaluation: Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh (2020,July). Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 4902–4912). https://www.aclweb.org/anthology/2020.acl-main.442.pdf
  • Data collection and annotation: Alex Brandsen, Suzan Verberne, Milco Wansleeben, Karsten Lambers (2020). Creating a Dataset for Named Entity Recognition in the Archaeology Domain (pdf). In the Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4573–4577. https://www.aclweb.org/anthology/2020.lrec-1.562.pdf
  • Chinese Named Entity Recognition: Peng, N., & Dredze, M. (2016, August). Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 149-155). https://www.aclweb.org/anthology/P16-2025.pdf
  • Sentiment analysis: Ruder, S., Ghaffari, P., & Breslin, J. G. (2016, November). A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 999-1005). https://www.aclweb.org/anthology/D16-1103.pdf
  • Summarization: Liu, Y., & Lapata, M. (2019, November). Text Summarization with Pretrained Encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3721-3731). https://arxiv.org/pdf/1908.08345v2.pdf
  • Bio-informatics: Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234-1240. https://academic.oup.com/bioinformatics/article/36/4/1234/5566506