Lsci 109: Linguistic Data Science
1 Course Information
Lecture times | TR 2-3:20pm |
Lecture Location | DBH 1431 |
Syllabus | http://socsci.uci.edu/~rfutrell/teaching/lsci109-w2020 |
Canvas site | https://canvas.eee.uci.edu/courses/23462 |
2 Instructor Information
Instructor | Richard Futrell (rfutrell@uci.edu) |
Instructor's office | SSPB 2215 |
Instructor's office hours | By appointment |
3 Course Description
This course teaches how to perform basic quantitative analysis of text data using Python and how to communicate the results of such analyses. Text data includes corpora of books, transcripts of spoken language, online reviews, and social media feeds. Students will complete exercises in Jupyter notebooks in which they will implement basic manipulation of text data, implement simple machine learning algorithms (classifiers and language models), deploy them to analyze bodies of text, and learn how to visualize and communicate the results of these analyses using Python visualization tools such as matplotlib and seaborn. The goal of the class is to develop students' skills in two areas: (1) the practical programming required to do analysis of text data in Python, (2) communication skills related to drawing conclusions based on algorithmic results, and presenting them in a way which is both precise and intuitive. A secondary goal is to develop students' knowledge of what kinds of analyses are possible given the current landscape of Natural Language Processing (NLP) technologies.
4 Course Format
Class time will be spent on lectures and guided exercises. Homework will consist of mini-projects involving different bodies of text: students will be asked to perform their own analyses and draw their own conclusions.
Students are required to bring laptops to class in order to work together on pair-programming exercises.
Students are encouraged to collaborate on homework.
Readings are optional but strongly recommended. They are short and you will find them immediately useful.
There are no exams or tests.
5 Intended audience
This course is intended for advanced undergraduates who already know the basics of Python programming, as demonstrated by passing ICS 033 or similar. The ability to program in Python is a prerequisite for enrollment in the course. I do not assume or require any background on statistics or math beyond high school algebra.
6 Schedule (subject to modification)
Day | Topic | Deadlines | Readings | Notebook |
---|---|---|---|---|
1/7 | Introduction | |||
1/9 | Text Wrangling | Tokenization | Notebook 1 | |
1/14 | Bags of Words | Generating n-grams | Notebook 2 | |
1/16 | UNKs and Stop Words | Notebook 3 | ||
1/21 | N-Grams and Association Measures | Tf-idf | Notebook 4 | |
1/23 | Tf-idf and Topic models | Topic Modeling with Gensim | Notebook 5 | |
1/28 | Project Time | |||
1/30 | Linear Classifiers | Notebook 6 | ||
2/4 | Naive Bayes | Mini-Project 1 Due | Naive Bayes | Notebook 7 |
2/6 | Evaluation metrics | Precision and recall | Notebook 8 | |
2/11 | More Classifiers | Logistic regression | Notebook 9 | |
2/13 | Feature Engineering | Notebook 10 | ||
2/18 | Visualization and Dimensionality Reduction | Visualizing high-dimensional datasets | Notebook 11 | |
2/20 | Project Time | |||
2/25 | Word embeddings I | Word embeddings | Notebook 12 | |
2/27 | Word embeddings II | Mini-project 2 due | Historical word embeddings | Notebook 13 |
3/3 | Contextual embeddings | Notebook 14 | ||
3/5 | Multiclass Classifiers and Modern NLP | Text classification with Transformers | Notebook 15 | |
3/10 | Project Time | |||
3/12 | Project Time (Prof. Futrell out of town) | |||
3/17 | - | Mini-project 3 due |
7 Requirements & Grading
Grade breakdown
Work Grade percentage Mini-projects 80% Participation 20% All mini-projects are equally weighted.
Mini-projects
In each mini-project, you will be provided with a dataset and with some analytical tools and your job will be to apply those tools to discover something about the dataset, and to produce a writeup that explains what you found and how you found it. The writeup will be in the form of a Jupyter or CoLab notebook.
You are highly encouraged to work in groups of up to 3 on mini-projects. Groups will turn in joint writeups. In a joint writeup, there should be a brief statement saying roughly who contributed what to the end product.
Each mini-project is due at at the beginning of class on the day indicated in the schedule. The final project is due at 2pm on 3/17.
Participation
To receive full credit for participation, you must attend every class session and participate actively in the group exercises. If you cannot make it to class, inform the instructor to find out what make-up work you can do to recover the participation credit.
Mapping of class score to letter grade
I guarantee minimum grades based on these thresholds:
Threshold Guaranteed minimum grade >= 90% A >= 80% B >= 70% C >= 60% D < 60% F So for example a score of 90.0001% guarantees you an A-. It is unlikely that I will grade the course on a curve, but if I do, you could end up with a higher grade due to the curve.
8 Academic Integrity
We will be adhering fully to the standards and practices set out in UCI's policy on academic integrity. Any attempts of academic misconduct or plagiarism will be met with consequences as per the university regulations.
9 Disability
Any student requesting academic accommodations based on a disability is required to apply with Disability Service Center at UCI. For more information, please visit http://disability.uci.edu/.