Ling 109: Linguistic Data Science
1 Course Information
Lecture times | 12:30 - 1:50 pm |
Lecture Location | SSL 155 |
Syllabus | http://socsci.uci.edu/~rfutrell/teaching/ling109-2019 |
Canvas site | https://canvas.eee.uci.edu/courses/17501 |
2 Instructor Information
Instructor | Richard Futrell (rfutrell@uci.edu) |
Instructor's office | SSPB 2215 |
Instructor's office hours | Mondays 4-5pm |
3 Course Description
This course teaches how to perform basic quantitative analysis of text data using Python and how to communicate the results of such analyses. Text data includes corpora of books, transcripts of spoken language, online reviews, and social media feeds. Students will complete exercises in Jupyter notebooks in which they will implement basic manipulation of text data, implement simple machine learning algorithms (classifiers and language models), deploy them to analyze bodies of text, and learn how to visualize and communicate the results of these analyses using Python visualization tools such as matplotlib and seaborn. The goal of the class is to develop students' skills in two areas: (1) the practical programming required to do analysis of text data in Python, (2) communication skills related to drawing conclusions based on algorithmic results, and presenting them in a way which is both precise and intuitive. A secondary goal is to develop students' knowledge of what kinds of analyses are possible given the current landscape of Natural Language Processing (NLP) technologies.
4 Course Format
Class time will be spent on lectures and guided exercises. Homework will consist of mini-projects involving different bodies of text: students will be asked to perform their own analyses and draw their own conclusions.
Students are required to bring laptops to class in order to work together an programming exercises.
Students are encouraged to collaborate on homework.
Readings are optional but strongly recommended. They are short and you will find them immediately useful.
There are no exams or tests.
5 Intended audience
This course is intended for advanced undergraduates who already know the basics of Python programming, as demonstrated by passing ICS 033 or similar. The ability to program in Python is a prerequisite for enrollment in the course. I do not assume or require any background on statistics or math beyond high school algebra.
6 Schedule (subject to modification)
Day | Topic | Deadlines | Readings | Notebook |
---|---|---|---|---|
4/2 | Getting started with | |||
4/4 | Text Wrangling | Tokenization | Notebook 1 | |
4/9 | Bags of Words | Generating n-grams | Notebook 2 | |
4/11 | N-grams | Collocations | Notebook 3 | |
4/16 | Association Measures | Tf-idf | Notebook 4 | |
4/18 | Document Similarity | Text similarity metrics | Notebook 5 | |
4/23 | Project Time | - | ||
4/25 | Linear Classifiers | Mini-project 1 due | Naive Bayes | Notebook 6 |
4/30 | (No class) | Precision and recall | ||
5/2 | Naive Bayes and Evaluation | - | Notebook 7 | |
5/7 | More Classifiers | Logistic regression | Notebook 8 | |
5/9 | Feature Engineering | Notebook 9 | ||
5/14 | Visualization and Dimensionality Reduction | Visualizing high-dimensional datasets | ||
5/16 | Project Time | |||
5/21 | Word embeddings I | Mini-project 2 due | Word embeddings | Notebook 10 |
5/23 | Word embeddings II | Historical word embeddings | Notebook 11 | |
5/28 | Sentence embeddings | Notebook 12 | ||
5/30 | Topic models | Topic Modeling with Gensim | Notebook 13 | |
6/4 | Project Time (Prof. Futrell out of town) | |||
6/6 | Project Time | |||
6/11 | - | Mini-project 3 due |
7 Requirements & Grading
Grade breakdown
Work Grade percentage Mini-projects 80% Participation 20% All mini-projects are equally weighted.
Mini-projects
In each mini-project, you will be provided with a dataset and with some analytical tools and your job will be to apply those tools to discover something about the dataset, and to produce a writeup that explains what you found and how you found it. The writeup will be in the form of a Jupyter notebook.
You are highly encouraged to work in groups of up to 3 on mini-projects. Groups will turn in joint writeups. In a joint writeup, there should be a brief statement saying roughly who contributed what to the end product.
Each mini-project is due at at the beginning of class on the day indicated in the schedule.
Participation
To receive full credit for participation, you must attend every class session and participate actively in the group exercises. If you cannot make it to class, inform the instructor to find out what make-up work you can do to recover the participation credit.
Mapping of class score to letter grade
I guarantee minimum grades based on these thresholds:
Threshold Guaranteed minimum grade >= 90% A >= 80% B >= 70% C >= 60% D < 60% F So for example a score of 90.0001% guarantees you an A-. It is unlikely that I will grade the course on a curve, but if I do, you could end up with a higher grade due to the curve.
8 Academic Integrity
We will be adhering fully to the standards and practices set out in UCI's policy on academic integrity. Any attempts of academic misconduct or plagiarism will be met with consequences as per the university regulations.
9 Disability
Any student requesting academic accommodations based on a disability is required to apply with Disability Service Center at UCI. For more information, please visit http://disability.uci.edu/.