Computational Linguistics, Spring 2021

(CAS LX 394, GRS LX 694)

Meeting time TR 12:30 - 1:45 (Zoom)
Instructor Paul Hagstrom
Email hagstrom@bu.edu
Office Location Kind of irrelevant.
Office Hours M 3-4, T 11-12, R 5-6

Course Description: Introduction to computational techniques to explore linguistic models and test empirical claims. Serves as an introduction to algorithms, and data structures, and tool libraries. Topics include tagging and classification, parsing models, meaning representation, corpus creation, information extraction.

Spring 2021 note on “tracks”: This course is usually taught as a low level course, which serves both as an introduction to programming in Python and as an introduction to Natural Language Processing and Computational Linguistics. Due to various circumstances, it will also be possible this semester to opt for a more advanced version of the course that assumes a programming background. The overall topics will largely be the same, but for people who have facility with programming already, we will approach some of the things in more detail and depth. Each student can opt into whichever “track” they feel comfortable with, and it is possible to switch around later, and to monitor what is happening in both tracks along the way. In order to accommodate this structure, the plan will be to cover common topics and concepts generally during the first class of a week, and then split the second class of the week up so that generally the first half hour is review of programming concepts for people who do not have background in programming, and the remainder is either common to both groups or, on occasion, mainly aimed at the group with more programming background. Again, everyone is perfectly welcome the whole time, regardless of which group they opt into. Later in the semester, depending on how well this works, we may merge back together once everyone has some basic programming/Python familiarity.

Prerequisites:

  • CAS LX 250 (Introduction to Linguistics)
  • Those with CAS CS 112 (Introduction to Computer Science 2) or equivalent likely should consider themselves to be in the more advanced track.

Antiprerequisites:

  • If you have taken CAS LX 496/GRS LX 796/MET LX 596 (Computational Linguistics), that is redundant with this course.

Course Synopsis: The quantity of language data available for natural language analysis has greatly increased in recent years, as has computing power and tool development. Doing large-scale language analysis to address theoretical issues in Linguistics is now within reach, and interest in natural language processing is similarly increasing for use in human-computer interfaces in industry. The purpose of this course is to gain facility with some of the powerful language analysis tools that are available for doing these kinds of large-scale analyses and to become familiar with the types of problems they are best suited to address. The course introduces the concepts and interfaces to some natural language toolkits that allow for characterizing and classifying texts, parsing syntactic structure, extracting and modeling information, processing basic logical relations, and basic natural language understanding. By the end of the course, students should have the background, and confidence, to use these techniques in addressing further linguistic questions that may arise beyond the class. The projects in the class will consist of some basic programming exercises, some mini-projects (generating poetry, determining authorship, development of child English), and a more extensive final project proposing a problem to investigate, a method for studying it, and a written paper reporting the results and implications.

Instructional Format

The course meets twice a week (on Tuesdays and Thursdays). The Tuesday meetings will largely introduce general concepts, and the Thursday meetings will tend to be more interactive/lab-type work. Weekly homework will be assigned.

Course materials

The primary textbook for the course is Steven Bird, Ewan Klein, and Edward Loper (2016). Natural Language Processing with Python (Python 3, NLTK 3 version). Other readings or lecture notes may be assigned from time to time.

Course web site

The primary web site for the course is at https://bucomplx.github.io/lx394s21/ – this is where the current schedule, handouts, readings, assignments, and announcements will be posted.

Grading and Discussion

We will use Slack for class discussion and questions, and Gradescope for homework submission and grading. You will receive an email with details about creating an account and logging on.

Assignments and grading criteria

This course can be taken either at the undergraduate or graduate level. For students taking the course at the graduate level, the homework assignments will contain some extra components, the topic of the final project will be proposed by the student, and the resulting project will be larger. The contributions to the final grade are summarized in the table below.

The course grade is based on three main things: weekly homework assignments, the small midterm group project, and the overall course project. The homework assignments will be exercises to build familiarity with the programming environment, or multi-step mini-projects building from a question to an analysis and answer and discussion. The homework assignments will mostly be concentrated in the first two-thirds of the course, with the remaining time focused more on projects. The midterm project will be done in small groups, and serves as a kind of “dry run” for the larger independent course project. The course project will involve proposing (graduate students) or selecting (undergraduate students) a project topic, working through the coding and data analysis, writing up a paper, and briefly presenting (graduate students) the results at the end of the semester. There are no exams (midterm, final) for this course.

Undergrad (LX 394) Grad (LX 694)
Homework 50% 45%
Small group project 20% 15%
Final project proposal 10%
Final project: results 10% 10%
Final project: paper 15% 15%
Final project: presentation —% 5%

Final course project

The final course project accounts for 30% (undergraduate) or 40% (graduate) of the grade overall, divided into several components. The proposal (graduate students) outlines the question to answer and the anticipated plan of approach. The “results” component is largely a status report, a sketch of what you’ve tried, what you’ve found, what’s left to do. Ideally this would be the data you will present in your paper, but without the prose you will be adding for the paper. The paper is a write-up of the project, which describes the problem, the approach, the results, the implications, and possible further directions the project could go. The paper should be approximately 10 pages (undergraduate) or 20 pages (graduate) of prose (not counting code, charts, tables, etc.). The presentation (grad students) is a short (10 minute) presentation of the project to the class on one of last Tuesdays of the semester.

Resources

If you have questions about course material or homework, take advantage of office hours (listed at the top of the syllabus). If you are a student with a disability or believe you might have a disability that requires accommodations, please contact the Office for Disability Services (ODS) at (617) 353-3658 to coordinate any reasonable accommodation requests. ODS is located at 19 Deerfield Street on the second floor. Generally, doing the exercises, homework, participating in class, and asking about questions that might have arisen in the material will be a reliable path to succeeding in this course.

Community of learning: Class and University policies

Participants in the class are all responsible for ensuring a positive learning environment, respecting other participants, and avoiding disruptive activities. Attendance is expected at all class meetings, and repeated failure to attend class will (in addition to making the out of class work more difficult to accomplish) decrease the participation portion of the course grade. Absence for religious reasons is allowed as outlined in the BU policy: https://www.bu.edu/academics/policies/absence-for-religious-reasons/ – if it is known in advance that you will be unable to attend one of the class meetings, the instructor should be notified so that alternative arrangements cans be made if needed. In general, homework is not accepted late unless this is arranged in advance of the due date.

Academic Conduct

It is imperative that the CAS Academic Conduct Code is adhered to, along with any applicable graduate policies. More specifically, it is allowed (even encouraged) to work in study groups or to talk through problems, but each student must write up and hand in assignemtns individually. No group documents should be created or circulated (and certainly not handed in). When you have worked with a group, it is encouraged to name the members in the group on your assignment, but this is not a requirement. The basic rule is that work you hand in as your own must be your own, not derived from the work of others – and where any work of others is involved, it is properly attributed. If there are any questions about the policy, we will be happy to answer them.

Hub Learning Outcomes and course-specific objectives

This course can be used to satisfy Quantitative Reasoning II, Digital Media Expression, and Toolkit/Research and Information Literacy units for the BU Hub. As a result of having taken this course, students will…

…frame and solve complex problems using quantitative tools, such as analytical, statistical, or computational methods. This outcome is central to the course objectives. Essentially all of the problems we address in this course concern the quantitative characterization, classification, and analysis of natural language corpora using the computational tools provided by the Natural Language Toolkit and related packages.

…apply quantitative tools in diverse settings to answer discipline-specific questions or to engage societal questions and debates. There are several Linguistics-specific questions and issues that can be studied through the use of large-scale corpus analysis. We will address questions about the relative rates of development of morphology in child language acquisition across languages (using the CHILDES corpora), approaches to automated information extraction and categorization from natural language texts, and problems faced by attempts to model natural language understanding in automated systems. Wile the processing and evaluation of natural language corpora is highly quantitative in nature, it connects to theoretical issues in Linguistics as well.

…formulate, and test an argument by marshaling and analyzing quantitative evidence. In several cases, the projects undertaken will involve deciding between hypotheses on the basis of the results of corpus analysis. One such example is the use of the CHILDES corpus to test hypotheses concerning correlations in development between language features such as verbal morphology and the presence and marking of subjects. For that project, students will presented with the basic theoretical concepts, specify the predictions that follow, and determine a means of testing whether they are borne out in the corpora at their disposal.

…communicate quantitative information symbolically, visually, numerically, or verbally. As part of the characterization of texts, it will regularly be necessary to condense and summarize results into tables and graphs that represent the results comprehensibly and concisely. This comes up in nearly every activity in the course, and students will learn a number of ways to geenrate graphical and tabular representations from their Python programs, as well as through the use of external specialized programs.

…recognize and articulate the capacity and limitations of quantitative methods and the risk of using them improperly. A recurring theme in the analysis of large corpora is the need to restrict attention to comparable and relevant subsets, to recognize and avoid analyzing uninformative parts of the data that could still skew results. There are some clear predictions that theoretical analyses make, for example in the connetion beteen context and grammatical form, that are not well suited to testing using any kind of automated corpus analysis presently within reach, due to a need for modeling fine-grained evolution of the developing discourse.

…be able to search for, select, and use a range of publicly available and discipline-specific information sources ethically and strategically to address research questions. This course is largely about methods for analyzing large corpora. There is a small set of corpora that we start with in common, but there are many others available on the internet, and methods of locating and processing these texts will be one of the main topics of the course. The CHILDES database has a well-defined set of policies on the corpora it contains, which serves as a model foe the broader questions of use, attribution, and re-distribution of data.

…demonstrate understanding of the overall research process and its component parts, and be able to formulate good research questions or hypotheses, gather and analyze information, and critique, interpret, and communicate findings. The themes of conducting a research project are developed in pieces throughout the course, with the final project providing students the opportunity to pull the pieces together into a larger and coherent research project. The final project occurs in a small number of stages to provide feedback as it is build up from concept to research question, to selection of method and corpus, and culminating in a written paper with a discussion of how the findings bear on the initial hypotheses.