Learn to scrape, parse, analyze, and visualize data for exploratory analysis and quantitative research. No previous programming knowledge is assumed.
The course's goals include unshackling academic researchers from the constraints of commercial, general-purpose statistics/GIS software and to free them from the limitations of working with pre-existing and pre-formatted data sets. Students in the course will learn to write programs in the interpreted programming language Python and the (open-source) statistical software language/environment R, as well as learning to use databases and to interact with a wide variety of existing software. The simultaneous instruction of two very different principal programming languages is intentional: the course's secondary goals are to demonstrate that data can be created, analyzed and visualized by a diversity of methods, and to encourage students not to be intimidated by unfamiliar computer programming dialects and interfaces. The course will introduce methods required to parse text files, scrape data from other sources, write structured programs for statistical analysis, create and query databases, visualize datasets, and conduct network analysis.
This year we will be using Piazza for class discussion. Rather than emailing questions to the teaching staff, we encourage you to post your questions on there.
|Week 1 (4/4/2014)||Operating Systems and Computer Science Basics|
Introduction to basic concepts of programming
languages and operating systems. Introduction to interpreted programming languages Python
and R, with comparison to SPSS/Stata. Introduction to command-line
Assignment 1, due 4/10/2014.
|Week 2 (4/11/2014)||Data types and data structures|
Lists, arrays, dictionaries. Vectors and matrices. Operations on these data types in Python and R. Introduction to the scipy data structure library for Python.
Readings to be done before Week 2 meeting:
|Week 3 (4/18/2014)||Structured programming and code management|
|Conditional statements, loops, functions,
modules. Using a text editor and organizing code in separate files.
Introduction to programming styles—functional, imperative,
Readings to be done before Week 3 meeting:
|Week 4 (4/25/2014)||Input and Output|
|Storing analyzed data for later reuse (Python's pickle and cPickle module). Reading and writing text files for other programs to use.|
|Week 5 (5/2/2014)||Web scraping and Content analysis|
Automating data collection from structured web pages. Rudimentary natural language processing.
For lecture this week you will need the tokenizer you wrote the previous week in class (or you can download week5_template.py or week5_template.R). You will also need the same list of stopwords and collection of H. P. Lovecraft stories.
Week 5 slides as PDF.
|Week 6 (5/9/2014)||Data management and Databases|
Introduction to relational databases, which provide stable and
network-accessible storage of medium-to-large (but not very
large) datasets. Comparison of spreadsheet interfaces (Excel, SPSS) with
relational databases like MySQL. Introduction to MySQLdb
library for Python.
Assignment 4, due 5/2/2014.
SQL books available on Proquest
|Week 7 (5/16/2014)||Plotting, Graphing, Visualization|
Visualizing descriptive statistics in R (histograms, box plots, scatter plots, etc.) Introduction to matplotlib for Python.
|Week 8 (5/23/2014)||Web development + network analysis|
Week 8 slides on web development and Flask.|
|Week 9 (5/30/2014)||Network Analysis|
| Basics of network representation, manipulation and visualization in Python's networkx library and R's sna, network, igraph.|
|Week 10 (6/6/2014)||TBA|
1 Students will be required to install Python and R distributions on their own computers.