Text Mining 2020-2021
Course schedule
The course weeks consist of: a lecture, literature to read, and either a practical exercise (tutorial style) or a hand-in assignment. The lectures are on Wednesday, 9.15-11.00, online in Kaltura (URL has been published in Brightspace).
The literature will be distributed on Brightspace. The majority of the chapters comes from this book, abbreviated as J&M in the course schedule below.
- J&M: Dan Jurafsky and James H. Martin, Speech and Language Processing (3rd ed), 2019
Week | Lecture | Literature | Exercise / assignment |
---|---|---|---|
1 (2 Sept) | Introduction | Informal break-out | |
2 (9 Sept) | Text processing | J&M chapter 2. Regular Expressions, Text Normalization, Edit Distance | Exercise: Chapter 1 of "Advanced NLP with Spacy" |
3 (16 Sept) | Vector Semantics | J&M chapter 6. Vector Semantics | Exercise: Tutorial about the semantics of word embeddings |
4 (23 Sept) | Text categorization | J&M chapter 4.1-4.3. Naive Bayes Classification Zhai & Massung chapter 15. Text categorization | Exercise: Text classification tutorial (sklearn) |
5 (30 Sept) | Data collection and annotation | Finin (2010). Annotating Named Entities in Twitter Data with Crowdsourcing McHugh (2012). Interrater reliability: the kappa statistic | Assignment 1. Text classification (deadline 12 Oct) |
6 (7 Oct) | Neural NLP and transfer learning | J&M chapter 7. Neural Nets and Neural Language Models | Exercise: BERT Fine-Tuning with Huggingface |
7 (14 Oct) | Information Extraction | J&M chapter 18. Information Extraction | Exercise: Sequence labelling tutorial (crfsuite) |
(21 Oct) | No lecture | ||
8 (28 Oct) | Text summarization | Kryściński et al (2019). Neural Text Summarization: A Critical Evaluation | Assignment 2. Information Extraction (deadline 2 Nov) |
9 (4 Nov) | Sentiment analysis | Select paper for literature assignment (see below) | |
10 (11 Nov) | Biomedical text mining | Fleuren & Alkema (2015). Application of text mining in the biomedical domain | Exercise: Sentiment analysis with BERT |
11 (18 Nov) | Paper presentations (groups) | Selection of recent literature | Assignment 3: literature (presentation 18 Nov) |
12 (25 Nov) | Industrial Text Mining: guest lecture by Mihai Rotaru, TextKernel | Papers for the final assignment (3 papers, one per benchmark task) | |
13 (2 Dec) | Conclusions | Final assignment: multiple tasks to choose from (deadline 11 Jan) |
The assessment of the course consists of a written exam (50% of course grade) and practical assignments (50% of course grade). The practical assignments comprise three smaller assignments (10% each) and one more substantial, final assignment (20%). The grade for the written exam should be 5.5 or higher in order to complete the course. The weighted average grade for the practical assignments should be 5.5 or higher in order to complete the course. If one of the tasks is not submitted the grade for that task is 0.
Earlier editions of this course
Link to the course page for this course in 2019-2020
Link to the course page for this course in 2018-2019
Papers for the literature assignment
(choose 1 per group from the list)- Fundamental: Tenney, I., Das, D., & Pavlick, E. (2019, July). BERT Rediscovers the Classical NLP Pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 4593-4601). https://www.aclweb.org/anthology/P19-1452.pdf
- Evaluation: Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh (2020,July). Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 4902–4912). https://www.aclweb.org/anthology/2020.acl-main.442.pdf
- Data collection and annotation: Alex Brandsen, Suzan Verberne, Milco Wansleeben, Karsten Lambers (2020). Creating a Dataset for Named Entity Recognition in the Archaeology Domain (pdf). In the Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4573–4577. https://www.aclweb.org/anthology/2020.lrec-1.562.pdf
- Chinese Named Entity Recognition: Peng, N., & Dredze, M. (2016, August). Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 149-155). https://www.aclweb.org/anthology/P16-2025.pdf
- Sentiment analysis: Ruder, S., Ghaffari, P., & Breslin, J. G. (2016, November). A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 999-1005). https://www.aclweb.org/anthology/D16-1103.pdf
- Summarization: Liu, Y., & Lapata, M. (2019, November). Text Summarization with Pretrained Encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3721-3731). https://arxiv.org/pdf/1908.08345v2.pdf
- Bio-informatics: Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234-1240. https://academic.oup.com/bioinformatics/article/36/4/1234/5566506