Course information

Computational Linguistics, Spring 2020

(CAS LX 496, GRS LX 796, MET LX 596)

Meeting time T 3:30 - 6:15 (meets in EPIC 205)
Instructors Paul Hagstrom Wayne Snyder
Email hagstrom@bu.edu snyder@bu.edu
Office Location 621 Comm. Ave.
Rm. 105
111 Cummington Mall
Room MCS 290
Office Hours MRF 12-1 M 5 - 7 (MCS 210)
T 7:30 - 10:30 (Rich Hall Cinema Room)
R 3 - 6 (MCS 210)

Course Description: Introduction to computational techniques to explore linguistic models and test empirical claims. Serves as an introduction to algorithms, and data structures, and tool libraries. Topics include tagging and classification, parsing models, meaning representation, corpus creation, information extraction.

Prerequisites:

  • CAS LX 250 (Introduction to Linguistics)
  • CAS CS 112 (Introduction to Computer Science 2)

Antiprerequisites:

  • CAS LX 394/GRS LX 694 (Introduction to Programming for Computational Linguistics). (Students who have taken the more basic version of the course are not eligible for this one.)

Course Synopsis: The quantity of language data available for natural language analysis has greatly increased in recent years, as has computing power and tool development. Doing large-scale language analysis to address theoretical issues in Linguistics is now within reach, and interest in natural language processing is similarly increasing for use in human-computer interfaces in industry. The purpose of this course is to gain facility with some of the powerful language analysis tools that are available for doing these kinds of large-scale analyses and to become familiar with the types of problems they are best suited to address. The course introduces the concepts and interfaces to some natural language toolkits that allow for characterizing and classifying texts, parsing syntactic structure, extracting and modeling information, processing basic logical relations, and basic natural language understanding. By the end of the course, students should have the background, and confidence, to use these techniques in addressing further linguistic questions that may arise beyond the class. The projects in the class will consist of some basic programming exercises, some mini-projects (generating poetry, determining authorship, development of child English), and a more extensive final project proposing a problem to investigate, a method for studying it, and a written paper reporting the results and implications.

Instructional Format

The course meets once a week on Tuesdays; the class meetings will be a mixture of introduction of the week’s concepts and ungraded exercises and readings. Weekly homework will be assigned.

Course materials

The primary textbook for the course is Steven Bird, Ewan Klein, and Edward Loper (2016). Natural Language Processing with Python (Python 3, NLTK 3 version). Other readings or lecture notes may be assigned from time to time.

Course web site

The primary web site for the course is at https://bucomplx.github.io/lx496s20/ – this is where the current schedule, handouts, readings, assignments, and announcements will be posted.

Grading and Discussion

We will use Piazza for class discussion and questions, and Gradescope for homework submission and grading. You will receive an email with details about creating an account and logging on.

Assignments and grading criteria

This course can be taken either at the undergraduate or graduate level. For students taking the course at the graduate level, the homework assignments will contain some extra components, the topic of the final project will be proposed by the student, and the resulting project will be larger. The contributions to the final grade are summarized in the table below.

The course grade is based on four things. The largest of these is the score for the weekly homework assignments. The homework assignments vary; some are exercises to build familiarity with the programming environment, others involve multi-step projects building from a question to an analysis and answer and discussion. In the final evaluation, the lowest-scoring homework assignment will be dropped from the computation. Homework will generally be weekly for the first two thirds of the course; in the latter third the final project will occupy this time. The second category is class participation, which accounts for 15% of the course grade. Participation includes attendance, being prepared to provide answers to exercises in discussion, and general engagement. The third category is the midterm exam (the same exam for graduate students and undergraduate students), which will largely be a test of facility with Python and with the natural language framework, will take place at about the middle of the semester, and account for 15% of the course grade. The final category is the final project, which accounts for 20% (undergraduate) or 30% (graduate) of the grade overall. Students taking the undergraduate course will have a project topic outlined for them to work with, while students taking the graduate course will be responsible for finding and proposing a suitable topic. A progress report in the form of a written methodology section will be due prior to the finished project. The finished project will take the form a research paper, 10 pages (undergraduate) to 20 pages (graduate) in length.

Undergraduate (LX 496) Graduate (LX 796)
Homework 50% 40%
Participation 15% 15%
Midterm 15% 15%
Final project: Proposal -- 10%
Final project: Methodology 10% 5%
Final project: Final paper 10% 15%

Resources

If you have questions about course material or homework, take advantage of office hours (listed at the top of the syllabus). If you are a student with a disability or believe you might have a disability that requires accommodations, please contact the Office for Disability Services (ODS) at (617) 353-3658 to coordinate any reasonable accommodation requests. ODS is located at 19 Deerfield Street on the second floor. Generally, doing the exercises, homework, participating in class, and asking about questions that might have arisen in the material will be a reliable path to succeeding in this course.

Community of learning: Class and University policies

Participants in the class are all responsible for ensuring a positive learning environment, respecting other participants, and avoiding disruptive activities. Attendance is expected at all class meetings, and repeated failure to attend class will (in addition to making the out of class work more difficult to accomplish) decrease the participation portion of the course grade. Absence for religious reasons is allowed as outlined in the BU policy: https://www.bu.edu/academics/policies/absence-for-religious-reasons/ – if it is known in advance that you will be unable to attend one of the class meetings, the instructor should be notified so that alternative arrangements cans be made if needed. In general, homework is not accepted late unless this is arranged in advance of the due date.

Academic Conduct

It is imperative that the CAS Academic Conduct Code is adhered to, along with any applicable graduate policies. More specifically, it is allowed (even encouraged) to work in study groups or to talk through problems, but each student must write up and hand in assignemtns individually. No group documents should be created or circulated (and certainly not handed in). When you have worked with a group, it is encouraged to name the members in the group on your assignment, but this is not a requirement. The basic rule is that work you hand in as your own must be your own, not derived from the work of others – and where any work of others is involved, it is properly attributed. If there are any questions about the policy, we will be happy to answer them.

Hub Learning Outcomes and course-specific objectives

This course can be used to satisfy Quantitative Reasoning II, Digital Media Expression, and Toolkit/Research and Information Literacy units for the BU Hub. As a result of having taken this course, students will…

…frame and solve complex problems using quantitative tools, such as analytical, statistical, or computational methods. This outcome is central to the course objectives. Essentially all of the problems we address in this course concern the quantitative characterization, classification, and analysis of natural language corpora using the computational tools provided by the Natural Language Toolkit and related packages.

…apply quantitative tools in diverse settings to answer discipline-specific questions or to engage societal questions and debates. There are several Linguistics-specific questions and issues that can be studied through the use of large-scale corpus analysis. We will address questions about the relative rates of development of morphology in child language acquisition across languages (using the CHILDES corpora), approaches to automated information extraction and categorization from natural language texts, and problems faced by attempts to model natural language understanding in automated systems. Wile the processing and evaluation of natural language corpora is highly quantitative in nature, it connects to theoretical issues in Linguistics as well.

…formulate, and test an argument by marshaling and analyzing quantitative evidence. In several cases, the projects undertaken will involve deciding between hypotheses on the basis of the results of corpus analysis. One such example is the use of the CHILDES corpus to test hypotheses concerning correlations in development between language features such as verbal morphology and the presence and marking of subjects. For that project, students will presented with the basic theoretical concepts, specify the predictions that follow, and determine a means of testing whether they are borne out in the corpora at their disposal.

…communicate quantitative information symbolically, visually, numerically, or verbally. As part of the characterization of texts, it will regularly be necessary to condense and summarize results into tables and graphs that represent the results comprehensibly and concisely. This comes up in nearly every activity in the course, and students will learn a number of ways to geenrate graphical and tabular representations from their Python programs, as well as through the use of external specialized programs.

…recognize and articulate the capacity and limitations of quantitative methods and the risk of using them improperly. A recurring theme in the analysis of large corpora is the need to restrict attention to comparable and relevant subsets, to recognize and avoid analyzing uninformative parts of the data that could still skew results. There are some clear predictions that theoretical analyses make, for example in the connetion beteen context and grammatical form, that are not well suited to testing using any kind of automated corpus analysis presently within reach, due to a need for modeling fine-grained evolution of the developing discourse.

…be able to search for, select, and use a range of publicly available and discipline-specific information sources ethically and strategically to address research questions. This course is largely about methods for analyzing large corpora. There is a small set of corpora that we start with in common, but there are many others available on the internet, and methods of locating and processing these texts will be one of the main topics of the course. The CHILDES database has a well-defined set of policies on the corpora it contains, which serves as a model foe the broader questions of use, attribution, and re-distribution of data.

…demonstrate understanding of the overall research process and its component parts, and be able to formulate good research questions or hypotheses, gather and analyze information, and critique, interpret, and communicate findings. The themes of conducting a research project are developed in pieces throughout the course, with the final project providing students the opportunity to pull the pieces together into a larger and coherent research project. The final project occurs in a small number of stages to provide feedback as it is build up from concept to research question, to selection of method and corpus, and culminating in a written paper with a discussion of how the findings bear on the initial hypotheses.