Text Mining 2019-2020
Course schedule
The course weeks consist of: a lecture, literature to read, and either a practical exercise (tutorial style) or a hand-in assignment. The lectures are on Wednesday, 9.15-11.00 in Snellius 407-409.
We use two textbooks in this course:
- Z&M: ChengXiang Zhai and Sean Massung, Text Data Management and Analysis (First edition), 2016
- J&M: Dan Jurafsky and James H. Martin, Speech and Language Processing (3rd ed. draft), 2018
Week | Lecture | Literature | Exercise / assignment |
---|---|---|---|
1 (4 Sept) | Introduction | Z&M chapter 1. Introduction | |
2 (11 Sept) | Text processing | J&M chapter 2. Regular Expressions, Text Normalization, Edit Distance | Exercise: Pre-processing tutorial |
3 (18 Sept) | Vector Semantics | J&M chapter 6. Vector Semantics | Exercise: Word embeddings tutorial |
4 (25 Sept) | Text categorization | J&M chapter 4.1-4.3. Naive Bayes Classification Z&M chapter 15. Text categorization | Exercise: Text classification tutorial (sklearn) |
5 (2 Oct) | Data collection and annotation | Finin (2010). Annotating Named Entities in Twitter Data with Crowdsourcing McHugh (2012). Interrater reliability: the kappa statistic | Assignment 1. Text classification |
6 (9 Oct) | Neural NLP and transfer learning (slides) | J&M chapter 7. Neural Nets and Neural Language Models | Exercise: BERT Fine-Tuning with PyTorch |
(16 Oct) | No lecture | ||
7 (23 Oct) | Information Extraction | J&M chapter 17. Information Extraction | Exercise: Sequence labelling tutorial (crfsuite) |
8 (30 Oct) | Text summarization | Z&M chapter 16. Summarization Kryściński et al (2019). Neural Text Summarization: A Critical Evaluation | Assignment 2. Information Extraction |
9 (6 Nov) | Sentiment analysis | Z&M chapter 18. Opinion Mining and Sentiment Analysis | |
10 (13 Nov) | Biomedical text mining | Fleuren & Alkema (2015). Application of text mining in the biomedical domain | Exercise: Sentiment analysis tutorial |
11 (20 Nov) | Authorship attribution | Literature for final assignment (3 topics to choose from) | Assignment 3. Sentiment Analysis |
12 (27 Nov) | Industrial Text Mining: guest lecture by TextKernel | Dahlmeier (2017). On the Challenges of Translating NLP Research into Commercial Products | |
13 (4 Dec) | Conclusions | Final assignment |
The assessment of the course consists of a written exam (50% of course grade) and practical assignments (50% of course grade). The practical assignments comprise three smaller assignments (10% each) and one more substantial, final assignment (20%). The grade for the written exam should be 5.5 or higher in order to complete the course. The average grade for the practical assignments should be 5.5 or higher in order to complete the course. If one of the tasks is not submitted the grade for that task is 0.
Earlier editions of this course
Link to the course page for this course in 2018-2019