Text Mining 2018-2019
Course schedule
The course weeks consist of: a lecture, literature to read, and either a practical exercise (tutorial style) or a hand-in assignment.
We use two textbooks in this course:
- Z&M: ChengXiang Zhai and Sean Massung, Text Data Management and Analysis (First edition), 2016
- J&M: Dan Jurafsky and James H. Martin, Speech and Language Processing (3rd ed. draft), 2018
Week | Lecture | Literature | Exercise / assignment |
---|---|---|---|
1 | Introduction | Z&M chapter 1. Introduction | |
2 | Text preprocessing | J&M chapter 2. Regular Expressions, Text Normalization, Edit Distance | Exercise: pre-processing noisy OCR'ed data |
3 | Data collection, annotation and evaluation (slides) | Finin (2010). Annotating Named Entities in Twitter Data with Crowdsourcing | Assignment 1. Pre-processing |
4 | Text categorization | J&M chapter 4.1-4.3. Naive Bayes Classification Z&M chapter 15. Text categorization | Exercise: Text classification tutorial (sklearn) |
5 | Information Retrieval | Z&M chapter 5 Overview of text data access | Assignment 2. Text classification |
6 | Information Extraction | J&M chapter 17. Information Extraction | Exercise: Sequence labelling tutorial (crfsuite) |
7 | Summarization | Z&M chapter 16. Summarization | Assignment 3. Sequence labelling |
8 | Vector semantics | J&M chapter 6. Vector Semantics | Exercise: Word embeddings tutorial |
9 | Sentiment analysis | Z&M chapter 18. Opinion Mining and Sentiment Analysis | Exercise: Sentiment analysis tutorial |
10 | Biomedical text mining | Fleuren & Alkema (2015). Application of text mining in the biomedical domain | Assignment 4. Sentiment Analysis |
11 | Authorship attribution | Literature for final assignment (3 topics to choose from) | |
12 | Industrial Text Mining | Dahlmeier (2017). On the Challenges of Translating NLP Research into Commercial Products | Final assignment |
The assessment of the course consists of a written exam (60% of course grade) and practical assignments (40% of course grade). The practical assignments comprise four small tasks (5% each) and one more substantial report (20%). The grade for the written exam should be 5.5 or higher in order to complete the course. The average grade for the practical assignments should be 5.5 or higher in order to complete the course. If one of the tasks is not submitted the grade for that task is 0.