Overview

Due before class Wednesday February 22nd.

Fork the hw07 repository

Go here to fork the repo for homework 07.

Sexy Joe Biden and Titanic (redux)

Last week you estimated statistical learning models about favorability towards Joe Biden and predicted survival and death on the Titanic, but did not yet know much about resampling methods. You will now reevaluate your results using resampling methods.

  1. Estimate a single linear regression model for Joe Biden and evaluate the model fit using mean squared error (MSE). You may choose whichever combination of predictors you feel appropriate.
    1. Estimate the model using the entire set of observations and calculate the MSE.
    2. Estimate the model using the validation set approach (70/30 training/test split) and calculate the MSE.
    3. Estimate the model using the LOOCV approach and calculate the MSE.
    4. Estimate the model using the 10-fold CV approach and calculate the MSE.
    5. Report on any discrepancies or differences between the estimated MSEs and briefly explain why they may differ from one another.
  2. Estimate three different logistic regression models for survival on the Titanic. You may reuse the same models as last week, or estimate new models.
    1. Estimate the models using the entire set of obserations and calculate the error rate of each model.
    2. Estimate the models using 10-fold CV and calculate the error rate of each model. How do these values compare to the original estimates using the full dataset? Which model performs the best?
    3. Take the model that performs the best and estimate bootstrap standard errors for the parameters. Are there significant differences between the standard errors from the original model trained on all the data vs. the bootstrap estimates?
  3. Take the model specification for your best-performing model from the Titanic problem, and estimate it using the random forest, decision tree, and logistic regression machine learning algorithms in sparklyr. Calculate the accuracy and AUC metrics for each model. Which algorithm performs the best?

Submit the assignment

Your assignment should be submitted as a set of R scripts, R Markdown documents, data files, figures, etc. Follow instructions on homework workflow. As part of the pull request, you’re encouraged to reflect on what was hard/easy, problems you solved, helpful tutorials you read, etc.

Rubric

Check minus: Cannot get code to run or is poorly documented. No documentation in the README file. Severe misinterpretations of the results. Overall a shoddy or incomplete assignment.

Check: Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output). Nothing spectacular, either bad or good.

Check plus: Interpretation is clear and in-depth. Accurately interprets the results, with appropriate caveats for what the technique can and cannot do. Code is reproducible. Writes a user-friendly README file. Discusses the benefits and drawbacks of a specific method. Compares multiple models fitted to the same underlying dataset.

This work is licensed under the CC BY-NC 4.0 Creative Commons License.