Syllabus and Setup


This class is an introduction to data cleaning, analysis and visualization. We will walk you through as we analyze real world datasets. Each day, we will spend the first 30 minutes introducing the day's concepts, and the rest of the class will be exercises. We have written a daily walkthrough that you will read and program through in class, and we will be available to help.

This is our first time teaching this course, and we'll be learning as much as you. Don't hesitate to ask us to change something or improve on something. We'll be grateful.


We assume you have a working knowledge of python (6.01) and are willing to write code. Most of the code you interact with will come with an example that you can modify. Hopefully little specialized code will be generated except for programs you're inspired to write!

What we will teach

We will teach the basics of data analysis through concrete examples. All of your programming will be written in python. The schedule is as follows:

What we will not teach

Programming Environment (Important!)

Before the class, please set up the environment. You will need to install some software, packages, and download some datasets to get started.

We assume that you are developing in a unix-like environment and are familiar with the common commands (e.g., less, man). If you are a windows user, we assume you are using cygwin but are on your own.

Tools and Libraries

In this class, you will need to install a number of tools. The major ones are:

We will also require a number of python modules:

For convenience, Enthought provides numpy, scipy, matplotlib in a single installable package. Many students that had trouble installing these modules separately were able to install Enthought.

dataiap/ Directory Structure

The repository contains the contents of the full course. We will be using


We will be working with several datasets in this course. Most of them have been added to the git repository.

The presidential contributions dataset is fairly large. We will use it on the first day, so please download it from

The datasets we will use are