Overview

Due before class Wednesday February 15th.

Fork the hw06 repository

Go here to fork the repo for homework 06.

Part 1: Sexy Joe Biden

Using statistical learning and data from the 2008 American National Election Studies survey, evaluate whether or not Leslie Knope’s attitudes towards Joe Biden are part of a broader trend within the American public. Specifically, do women display higher feeling thermometer ratings for Joe Biden than men?1 biden.csv contains a selection of variables from the larger survey that also allow you to test competing factors that may influence attitudes towards Joe Biden.

• biden - ranges from 0-100
• female - 1 if individual is female, 0 if individual is male
• pid - party identification
• 0 - Democrat
• 1 - Independent
• 2 - Republican
• age - age of respondent in years
• educ - number of years of formal education completed
• 17 - 17+ years (aka first year of graduate school and up)
1. Estimate a basic linear regression model of the relationship between gender and feelings towards Joe Biden. Calculate predicted values, graph the relationship between the two variables using the predicted values, and determine whether there appears to be a significant relationship.
2. Calculate the residuals for the training set based on the single variable model and compare the distribution of relationships by party ID. Does this model perform similarly for each party type?
3. Estimate separate models for each party ID type and compare the estimated coefficients. Does this confirm or refute your conclusions about the residuals?
• If you want to be super cool, turn this comparison into a visualization.

Part 2: Revisiting the Titanic

We’ve looked a lot at the Titanic data set. Now I want you to make your own predictions about who lived and who died.

1. Load the Titanic data from library(titanic). Use the titanic_train data frame. Split the data into a 70/30 training/test set.
• Be sure to set.seed() to ensure reproducibility of results.
2. Estimate three different logistic regression models using the training set with Survived as the response variable. You may use any combination of the predictors to estimate these models. Don’t just reuse the models from the notes.
3. Calculate the test set accuracy rate for each logistic regression model. Which performs the best?
4. Now estimate three random forest models using the same model specifications as the logistic regression models. That is, if your first logistic regression model uses Sex + Age as the predictors, your first random forest model should also use Sex + Age.
• Ordinarily you would not need to split your data into training and test sets for random forest models. However for the purposes of this assignment, estimate the random forest models using only the training set.
• Generate random forests with 500 trees apiece.
5. Generate variable importance plots for each random forest model. Which variables seem the most important?
6. Calculate the test set accuracy rate for each random forest model. Which performs the best?
7. Compare the test set accuracy rates between the logistic regression and random forest models. Which method (logistic regression or random forest) performed better?

Submit the assignment

Your assignment should be submitted as a set of R scripts, R Markdown documents, data files, figures, etc. Follow instructions on homework workflow. As part of the pull request, you’re encouraged to reflect on what was hard/easy, problems you solved, helpful tutorials you read, etc.

Rubric

Check minus: Cannot get code to run or is poorly documented. No documentation in the README file. Severe misinterpretations of the results. Overall a shoddy or incomplete assignment.

Check: Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output). Nothing spectacular, either bad or good.

Check plus: Interpretation is clear and in-depth. Accurately interprets the results, with appropriate caveats for what the technique can and cannot do. Code is reproducible. Writes a user-friendly README file. Discusses the benefits and drawbacks of a specific method. Compares multiple models fitted to the same underlying dataset.

1. Feeling thermometers are a common metric in survey research used to gauge attitudes or feelings of warmth towards individuals and institutions. They range from 0-100, with 0 indicating extreme coldness and 100 indicating extreme warmth.