Ling 109: Linguistic Data Science

1 Course Information

Lecture times 12:30 - 1:50 pm
Lecture Location SSL 155
Canvas site

2 Instructor Information

Instructor Richard Futrell (
Instructor's office SSPB 2215
Instructor's office hours Mondays 4-5pm

3 Course Description

This course teaches how to perform basic quantitative analysis of text data using Python and how to communicate the results of such analyses. Text data includes corpora of books, transcripts of spoken language, online reviews, and social media feeds. Students will complete exercises in Jupyter notebooks in which they will implement basic manipulation of text data, implement simple machine learning algorithms (classifiers and language models), deploy them to analyze bodies of text, and learn how to visualize and communicate the results of these analyses using Python visualization tools such as matplotlib and seaborn. The goal of the class is to develop students' skills in two areas: (1) the practical programming required to do analysis of text data in Python, (2) communication skills related to drawing conclusions based on algorithmic results, and presenting them in a way which is both precise and intuitive. A secondary goal is to develop students' knowledge of what kinds of analyses are possible given the current landscape of Natural Language Processing (NLP) technologies.

4 Course Format

Class time will be spent on lectures and guided exercises. Homework will consist of mini-projects involving different bodies of text: students will be asked to perform their own analyses and draw their own conclusions.

Students are required to bring laptops to class in order to work together an programming exercises.

Students are encouraged to collaborate on homework.

Readings are optional but strongly recommended. They are short and you will find them immediately useful.

There are no exams or tests.

5 Intended audience

This course is intended for advanced undergraduates who already know the basics of Python programming, as demonstrated by passing ICS 033 or similar. The ability to program in Python is a prerequisite for enrollment in the course. I do not assume or require any background on statistics or math beyond high school algebra.

Introductory survey.

Information on UCI's Language Science major and minor.

6 Schedule (subject to modification)

Day Topic Deadlines Readings Notebook
4/2 Getting started with      
4/4 Text Wrangling   Tokenization Notebook 1
4/9 Bags of Words   Generating n-grams Notebook 2
4/11 N-grams   Collocations Notebook 3
4/16 Association Measures   Tf-idf Notebook 4
4/18 Document Similarity   Text similarity metrics Notebook 5
4/23 Project Time   -  
4/25 Linear Classifiers Mini-project 1 due Naive Bayes Notebook 6
4/30 (No class)   Precision and recall  
5/2 Naive Bayes and Evaluation   - Notebook 7
5/7 More Classifiers   Logistic regression Notebook 8
5/9 Feature Engineering     Notebook 9
5/14 Visualization and Dimensionality Reduction   Visualizing high-dimensional datasets  
5/16 Project Time      
5/21 Word embeddings I Mini-project 2 due Word embeddings Notebook 10
5/23 Word embeddings II   Historical word embeddings Notebook 11
5/28 Sentence embeddings     Notebook 12
5/30 Topic models   Topic Modeling with Gensim Notebook 13
6/4 Project Time (Prof. Futrell out of town)      
6/6 Project Time      
6/11 - Mini-project 3 due    

7 Requirements & Grading

  • Grade breakdown

    Work Grade percentage
    Mini-projects 80%
    Participation 20%

    All mini-projects are equally weighted.

  • Mini-projects

    In each mini-project, you will be provided with a dataset and with some analytical tools and your job will be to apply those tools to discover something about the dataset, and to produce a writeup that explains what you found and how you found it. The writeup will be in the form of a Jupyter notebook.

    You are highly encouraged to work in groups of up to 3 on mini-projects. Groups will turn in joint writeups. In a joint writeup, there should be a brief statement saying roughly who contributed what to the end product.

    Each mini-project is due at at the beginning of class on the day indicated in the schedule.

    Mini-project 1

    Prof. Futrell's solution to Mini-project 1

    Mini-project 2

  • Participation

    To receive full credit for participation, you must attend every class session and participate actively in the group exercises. If you cannot make it to class, inform the instructor to find out what make-up work you can do to recover the participation credit.

  • Mapping of class score to letter grade

    I guarantee minimum grades based on these thresholds:

    Threshold Guaranteed minimum grade
    >= 90% A
    >= 80% B
    >= 70% C
    >= 60% D
    < 60% F

    So for example a score of 90.0001% guarantees you an A-. It is unlikely that I will grade the course on a curve, but if I do, you could end up with a higher grade due to the curve.

8 Academic Integrity

We will be adhering fully to the standards and practices set out in UCI's policy on academic integrity. Any attempts of academic misconduct or plagiarism will be met with consequences as per the university regulations.

9 Disability

Any student requesting academic accommodations based on a disability is required to apply with Disability Service Center at UCI. For more information, please visit

Author: Richard Futrell

Created: 2019-05-30 Thu 09:04