Welcome!
This class is an introduction to data cleaning, analysis and visualization. We will walk you through as we analyze real world datasets. Each day, we will spend the first 30 minutes introducing the day's concepts, and the rest of the class will be exercises. We have written a daily walkthrough that you will read and program through in class, and we will be available to help.
This is our first time teaching this course, and we'll be learning as much as you. Don't hesitate to ask us to change something or improve on something. We'll be grateful.
We assume you have a working knowledge of python (6.01) and are willing to write code. Most of the code you interact with will come with an example that you can modify. Hopefully little specialized code will be generated except for programs you're inspired to write!
We will teach the basics of data analysis through concrete examples. All of your programming will be written in python. The schedule is as follows:
Day 0 (today): setup
Day 1: An end-to-end example getting you from a dataset found online to several plots of campaign contributions.
Day 2: Lots of visualization examples, and practice going from data to chart.
Day 3: Statistics basics, including T-Tests, Linear Regression, and statistical significance. We'll use campaign finance and per-county health rankings.
Day 4: Text processing on a large text corpus (the Enron email dataset) using tf-idf and cosine similarity.
Day 5: Scaling up to process large datasets using Hadoop/MapReduce on a larger copy of the Enron dataset.
Day 6: You tell us! Get into groups or work on your own to analyze a dataset of your choosing, and tell us a story!
R. R is a wonderful data analysis, statistics, and plotting framework. We will not be using it because we can achieve all of our objectives in Python, and more MIT undergraduates know Python.
Visualization using browser technology (canvas, svg, d3, etc) or in non python languages (Processing. These tools are very interesting, and lots of visualizations on the web use these tools (e.g., nytimes visualizations), however they are out of the scope of this class. We'll teach you how to visualize data in static charts. If this is an area of interest for you, the next step will be to build interactive visualizations that the world can explore, and we can point you in the right direction with these.
Before the class, please set up the environment. You will need to install some software, packages, and download some datasets to get started.
We assume that you are developing in a unix-like environment and are familiar with the common commands (e.g., less, man). If you are a windows user, we assume you are using cygwin but are on your own.
In this class, you will need to install a number of tools. The major ones are:
python --version
to make sure it is the right versionsudo easy_install pip
or download the tar.gz file at the link above, untar it, go into the newly created directory, and type sudo python setup.py install
.dataiap
using git clone
git://github.com/dataiap/dataiap.git dataiap
dataiap
directory and type git pull
.We will also require a number of python modules:
sudo pip install numpy
scipy 0.10: scientific computing module.
sudo apt-get install python-scipy
sudo pip install scipy
git clone https://github.com/scipy/scipy.git
cd scipy
python setup.py build
python setup.py install
sudo pip install matplotlib
sudo pip install python-dateutil
sudo pip install pyparsing
sudo pip install mrjob
sudo pip install boto
).For convenience, Enthought provides numpy, scipy, matplotlib in a single installable package. Many students that had trouble installing these modules separately were able to install Enthought.
dataiap/
Directory StructureThe repository contains the contents of the full course. We will be using
dayX/
: files containing the lecture for day Xdatasets/
: the datasets we will be using should live hereresources/
: contains python scripts that you will eventually runutil/
: contains python modules we have written that you will use in this course.inst/
: instructor python files. Used to setup and test the labs. Please don't view during the course.We will be working with several datasets in this course. Most of them have been added to the git repository.
The presidential contributions dataset is fairly large. We will use it on the first day, so please download it from ftp://ftp.fec.gov/FEC/Presidential_Map/2008/P00000001/P00000001-ALL.zip.
The datasets we will use are
dataiap/datasets/pres_campaign/
dataiap/datasets/county_health_rankings/additional_measures_cleaned.csv
dataiap/datasets/county_health_rankings/ypll.csv
dataiap/datasets/emails/kenneth.zip
contains a subset of Kenneth Lay's emails that you will analyze in day 4.dataiap/datasets/emails/kenneth_json.zip
contains a JSON-encoded subset of Kenneth Lay's emails that you will analyze in day 5.