Due before class April 30th.

Fork the hw05 repository

Go here to fork the repo for homework 05.

What is my objective?

At this half-way point in the term, I want to check and make sure everyone is up to speed on the major skills learned so far:

I also want to demonstrate the value of these skills for research that interests you. Therefore in this assignment, I want you to write a short report on a research question of your own interest. Frame it as you would if you were submitting it for a substantive seminar in your research field, though much shorter and comprehensive then a term paper. It should be approximately 750-1000 words in length and showcase the major skills identified above. It does not need to be an advanced statistical analysis involving complex statistical modeling and skills we have not yet learned. The actual analysis can be relative simple - again, think exploratory. Analyzing the distribution of variables and how they are related to one another at a bivariate level is more than adequate.

What data should I use?

Whatever you want! The important thing is that the entire analysis is reproducible. That is, I will clone your repository on my computer and attempt to reproduce your results. This means you should provide an informative README.md file that:

I’m not creative and I can’t think of anything to analyze!

Okay, then analyze one of the datasets we’ve used before.

How can I automatically download the data

There are functions in R and programs for the shell that allow you to do this. For example, if I wanted to download gapminder from the original GitHub repo:

  • Option 1: via an R script using downloader::download or RCurl::getURL.

    cat(file = "gapminder.tsv",
  • Option 2: in a shell script using curl or wget.

    curl -O https://raw.githubusercontent.com/jennybc/gapminder/master/inst/gapminder.tsv
    wget https://raw.githubusercontent.com/jennybc/gapminder/master/inst/gapminder.tsv
  • Option 3: manually download and save a copy of the data file(s) in your repo. Make sure to commit and push them to GitHub.

What if my data file is large?

Because of how Git tracks changes in files, GitHub will not allow you to commit and push a file larger than 100mb. If you try to do so, you will get an error message and the commit will not push. Worse yet, you know have to find a way to strip all trace of the data file from the Git repo (including previous commits) before you can sync up your fork. This is a pain in the ass. Avoid it as much as possible. If you follow option 1 and 2, then you do not need to store the data file in the repo because it is automatically downloaded by your script/R Markdown document.

If you have to store a large data file in your repo, use Git Large File Storage. It is a separate program you need to install via the shell, but the instructions are straight-forward. It integrates smoothly into GitHub, and makes version tracking of large files far easier. If you include it in a course-related repo (i.e. a fork of the homework repos), then there is no cost. If you want to use Git LFS for your own work, there are separate fees charged by GitHub for storage and bandwidth usage.

Perform exploratory analysis

  • Import the data
  • Tidy it as necessary to get it into a tidy data structure
  • Generate some descriptive plots of the data
  • Summarize the relationships you discover with a written summary. Conjecture as to why they occur and/or why they may be spurious.

The final output should be a github_document, but feel free to use R scripts in your initial work or create a pipeline that executes and renders all your scripts/R Markdown files at once.

Aim higher!

Submit the assignment

Your assignment should be submitted as a set of R scripts, R Markdown documents, data files, figures, etc. Follow instructions on homework workflow. As part of the pull request, you’re encouraged to reflect on what was hard/easy, problems you solved, helpful tutorials you read, etc.


In your reflection, make special note of any significant problems that required debugging. Try to be specific about your process. Did you receive any helpful error or warning message? Did you use traceback() to hunt down the source of the bug? How did you resolve it? You don’t need to do this for every bug, but keep track of at least one or two major errors you had to resolve.


Check minus: Cannot reproduce your results. Scripts require interactive coding to fix. Markdown documents are not generated. Graphs and tables don’t have appropriate labels or formatting. There is no consistency to your code’s style.

Check: Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output). Nothing spectacular, either bad or good.

Check plus: Repository contains a detailed README.md explaining how the files in the repo should be executed. Displays innovative data analysis or coding skills. Graphs and tables are well labeled. Excellent implementation of a consistent style guide. Analysis is insightful. I walk away feeling I learned something.


This work is licensed under the CC BY-NC 4.0 Creative Commons License.