
Learn to scrape, parse, analyze, and visualize data for exploratory analysis and quantitative research. No previous programming knowledge is assumed.
The workshop's goals
include unshackling
academic researchers from the constraints of commercial,
general-purpose statistics/GIS software and to free them from the
limitations of working with pre-existing and pre-formatted data sets.
Students in the workshop will learn to write programs in the
interpreted programming
language Python and the (open-source) statistical software language/environment R, as
well as learning to use databases and to interact with a wide variety of existing
software. The simultaneous instruction of two very different principal
programming languages is intentional: the workshop's secondary goals are
to demonstrate that data can be created, analyzed and visualized by a diversity
of methods, and to encourage students not to be intimidated by
unfamiliar computer programming dialects and interfaces. The workshop will
introduce methods required to parse text files, scrape
data from other sources, write structured programs for statistical
analysis, create and query databases, visualize datasets, and conduct
network analysis.
| Week 1 (4/4/2013) | Operating Systems and Computer Science Basics |
|
Introduction to basic concepts of operating systems and programming
languages. Introduction to interpreted programming languages Python
and R, with comparison to SPSS/Stata. Introduction to command-line
interfaces.
Assignment 1 is available here | |
| Week 2 (4/11/2013) | Data types and data structures
|
|
Lists, arrays, dictionaries. Vectors and matrices. Operations on these data types in Python and R. Introduction to the scipy data structure library for Python.
Readings to be done before Week 2 meeting: |
|
| Week 3 (4/18/2013) | Structured programming and code management |
| Conditional statements, loops, functions, modules. Using a text editor and organizing code in separate files. Introduction to programming styles—functional, imperative, object-oriented. | |
| Week 4 (Friday, 4/26/2013) | Input and Output NOTE: changed meeting date |
|
Storing analyzed data for later reuse (Python's pickle and
cPickle module).
| |
| Week 5 (5/2/2013 in Searle 240B) | Web scraping and Content analysis NOTE: changed location |
| Week 6 (5/9/2013) | Plotting, Graphing, Visualization |
| Visualizing descriptive statistics in R (histograms, box plots, scatter plots, etc.) Introduction to matplotlib for Python. | |
| Week 7 (5/16/2013) | Data management and Databases |
|
Introduction to relational databases, which provide stable and
network-accessible storage of medium-to-large (but not very
large) datasets. Comparison of spreadsheet interfaces (Excel, SPSS) with
relational databases like MySQL. Introduction to MySQLdb
library for Python.
SQL books available on Proquest
| |
| Week 8 (5/23/2013) | Data Analysis & Data Mining |
| Building structured, multi-stage programs that systematically evaluate hypotheses and explore patterns in data. | |
| Week 9 (5/30/2013) | Network Analysis |
| Basics of network representation, manipulation and visualization in Python's networkx library and R's sna, network, igraph. | |
| Week 10 (6/6/2013) | TBA |
1 Students will be required to install Python and R distributions on their own computers.