Text Mining master course 2020-2021

Text Mining 2020-2021

Teacher: Suzan Verberne
Teaching assistants: Michiel van der Meer, Jeroen Rook, Jan van Staalduinen, Mitch Angenent
Contact address: tmcourse@liacs.leidenuniv.nl

Course schedule

The course weeks consist of: a lecture, literature to read, and either a practical exercise (tutorial style) or a hand-in assignment. The lectures are on Wednesday, 9.15-11.00, online in Kaltura (URL has been published in Brightspace).

The literature will be distributed on Brightspace. The majority of the chapters comes from this book, abbreviated as J&M in the course schedule below.

J&M: Dan Jurafsky and James H. Martin, Speech and Language Processing (3rd ed), 2019

Week	Lecture	Literature	Exercise / assignment
1 (2 Sept)	Introduction		Informal break-out
2 (9 Sept)	Text processing	J&M chapter 2. Regular Expressions, Text Normalization, Edit Distance	Exercise: Chapter 1 of "Advanced NLP with Spacy"
3 (16 Sept)	Vector Semantics	J&M chapter 6. Vector Semantics	Exercise: Tutorial about the semantics of word embeddings
4 (23 Sept)	Text categorization	J&M chapter 4.1-4.3. Naive Bayes Classification Zhai & Massung chapter 15. Text categorization	Exercise: Text classification tutorial (sklearn)
5 (30 Sept)	Data collection and annotation	Finin (2010). Annotating Named Entities in Twitter Data with Crowdsourcing McHugh (2012). Interrater reliability: the kappa statistic	Assignment 1. Text classification (deadline 12 Oct)
6 (7 Oct)	Neural NLP and transfer learning	J&M chapter 7. Neural Nets and Neural Language Models	Exercise: BERT Fine-Tuning with Huggingface
7 (14 Oct)	Information Extraction	J&M chapter 18. Information Extraction	Exercise: Sequence labelling tutorial (crfsuite)
(21 Oct)	No lecture
8 (28 Oct)	Text summarization	Kryściński et al (2019). Neural Text Summarization: A Critical Evaluation	Assignment 2. Information Extraction (deadline 2 Nov)
9 (4 Nov)	Sentiment analysis	Select paper for literature assignment (see below)
10 (11 Nov)	Biomedical text mining	Fleuren & Alkema (2015). Application of text mining in the biomedical domain	Exercise: Sentiment analysis with BERT
11 (18 Nov)	Paper presentations (groups)	Selection of recent literature	Assignment 3: literature (presentation 18 Nov)
12 (25 Nov)	Industrial Text Mining: guest lecture by Mihai Rotaru, TextKernel	Papers for the final assignment (3 papers, one per benchmark task)
13 (2 Dec)	Conclusions		Final assignment: multiple tasks to choose from (deadline 11 Jan)

The assessment of the course consists of a written exam (50% of course grade) and practical assignments (50% of course grade). The practical assignments comprise three smaller assignments (10% each) and one more substantial, final assignment (20%). The grade for the written exam should be 5.5 or higher in order to complete the course. The weighted average grade for the practical assignments should be 5.5 or higher in order to complete the course. If one of the tasks is not submitted the grade for that task is 0.

Earlier editions of this course

Link to the course page for this course in 2019-2020
Link to the course page for this course in 2018-2019

Papers for the literature assignment

(choose 1 per group from the list)

Fundamental: Tenney, I., Das, D., & Pavlick, E. (2019, July). BERT Rediscovers the Classical NLP Pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 4593-4601). https://www.aclweb.org/anthology/P19-1452.pdf
Evaluation: Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh (2020,July). Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 4902–4912). https://www.aclweb.org/anthology/2020.acl-main.442.pdf
Data collection and annotation: Alex Brandsen, Suzan Verberne, Milco Wansleeben, Karsten Lambers (2020). Creating a Dataset for Named Entity Recognition in the Archaeology Domain (pdf). In the Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4573–4577. https://www.aclweb.org/anthology/2020.lrec-1.562.pdf
Chinese Named Entity Recognition: Peng, N., & Dredze, M. (2016, August). Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 149-155). https://www.aclweb.org/anthology/P16-2025.pdf
Sentiment analysis: Ruder, S., Ghaffari, P., & Breslin, J. G. (2016, November). A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 999-1005). https://www.aclweb.org/anthology/D16-1103.pdf
Summarization: Liu, Y., & Lapata, M. (2019, November). Text Summarization with Pretrained Encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3721-3731). https://arxiv.org/pdf/1908.08345v2.pdf
Bio-informatics: Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234-1240. https://academic.oup.com/bioinformatics/article/36/4/1234/5566506