# Practicing tidytext with Hamilton

library(tidyverse)
library(tidytext)
library(ggtext)
library(here)

set.seed(123)
theme_set(theme_minimal())


About seven months ago, my wife and I became addicted to Hamilton.

I admit, we were quite late to the party. I promise we did like it, but I wanted to wait and see the musical in-person before listening to the soundtrack. Alas, having three small children limits your free time to go out to the theater for an entire evening. So I finally caved and started listening to the soundtrack on Spotify. And it’s amazing! My son’s favorite song (he’s four BTW) is My Shot.

One of the nice things about the musical is that it is sung-through, so the lyrics contain essentially all of the dialogue. This provides an interesting opportunity to use the tidytext package to analyze the lyrics. Here, I use the geniusr package to obtain the complete lyrics from Genius.1

hamilton <- read_csv(file = here("static", "data", "hamilton.csv")) %>%
mutate(song_name = parse_factor(song_name))

##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   song_number = col_double(),
##   song_name = col_character(),
##   line_num = col_double(),
##   line = col_character(),
##   speaker = col_character()
## )

glimpse(hamilton)

## Rows: 3,532
## Columns: 5
## $song_number <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … ##$ song_name   <fct> Alexander Hamilton, Alexander Hamilton, Alexander Hamilto…
## $line_num <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17… ##$ line        <chr> "How does a bastard, orphan, son of a whore and a", "Scot…
!word2 %in% get_stopwords(source = "smart")$word ) %>% drop_na(word1, word2) %>% count(word1, word2, sort = TRUE) # filter for only relatively common combinations bigram_graph <- hamilton_pair %>% filter(n > 3) %>% igraph::graph_from_data_frame() # draw a network graph set.seed(1776) # New York City ggraph(bigram_graph, layout = "fr") + geom_edge_link(aes(edge_alpha = n, edge_width = n), show.legend = FALSE, alpha = .5) + geom_node_point(color = "#0052A5", size = 3, alpha = .5) + geom_node_text(aes(label = name), vjust = 1.5) + ggtitle("Word Network in Lin-Manuel Miranda's *Hamilton*") + theme_void() + theme(plot.title = element_markdown())  Finally we can examine the colocation of pairs of words to look for common usage. It’s apparent there are several major themes detected through this approach, including the Hamilton/Jefferson relationship, “Aaron Burr, sir”, Philip’s song with his mother (un, deux, trois, quatre, …), the rising up of the colonies, and those young, scrappy, and hungry men. ## Acknowledgments ## Session Info devtools::session_info()  ## ─ Session info ─────────────────────────────────────────────────────────────── ## setting value ## version R version 4.1.0 (2021-05-18) ## os macOS Big Sur 10.16 ## system x86_64, darwin17.0 ## ui X11 ## language (EN) ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz America/Chicago ## date 2021-09-01 ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date lib source ## assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0) ## backports 1.2.1 2020-12-09 [1] CRAN (R 4.1.0) ## bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.0) ## bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.0) ## blogdown 1.4 2021-07-23 [1] CRAN (R 4.1.0) ## bookdown 0.23 2021-08-13 [1] CRAN (R 4.1.0) ## broom 0.7.9 2021-07-27 [1] CRAN (R 4.1.0) ## bslib 0.2.5.1 2021-05-18 [1] CRAN (R 4.1.0) ## cachem 1.0.6 2021-08-19 [1] CRAN (R 4.1.0) ## callr 3.7.0 2021-04-20 [1] CRAN (R 4.1.0) ## cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.1.0) ## cli 3.0.1 2021-07-17 [1] CRAN (R 4.1.0) ## codetools 0.2-18 2020-11-04 [1] CRAN (R 4.1.0) ## colorspace 2.0-2 2021-06-24 [1] CRAN (R 4.1.0) ## crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0) ## curl 4.3.2 2021-06-23 [1] CRAN (R 4.1.0) ## DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0) ## dbplyr 2.1.1 2021-04-06 [1] CRAN (R 4.1.0) ## desc 1.3.0 2021-03-05 [1] CRAN (R 4.1.0) ## devtools 2.4.2 2021-06-07 [1] CRAN (R 4.1.0) ## digest 0.6.27 2020-10-24 [1] CRAN (R 4.1.0) ## dplyr * 1.0.7 2021-06-18 [1] CRAN (R 4.1.0) ## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0) ## evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0) ## fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0) ## farver 2.1.0 2021-02-28 [1] CRAN (R 4.1.0) ## fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0) ## forcats * 0.5.1 2021-01-27 [1] CRAN (R 4.1.0) ## fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0) ## generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0) ## geniusr * 1.2.0 2020-04-13 [1] CRAN (R 4.1.0) ## ggforce 0.3.3 2021-03-05 [1] CRAN (R 4.1.0) ## ggplot2 * 3.3.5 2021-06-25 [1] CRAN (R 4.1.0) ## ggraph * 2.0.5 2021-02-23 [1] CRAN (R 4.1.0) ## ggrepel 0.9.1 2021-01-15 [1] CRAN (R 4.1.0) ## ggtext * 0.1.1 2020-12-17 [1] CRAN (R 4.1.0) ## glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0) ## graphlayouts 0.7.1 2020-10-26 [1] CRAN (R 4.1.0) ## gridExtra 2.3 2017-09-09 [1] CRAN (R 4.1.0) ## gridtext 0.1.4 2020-12-10 [1] CRAN (R 4.1.0) ## gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.0) ## haven 2.4.3 2021-08-04 [1] CRAN (R 4.1.0) ## here * 1.0.1 2020-12-13 [1] CRAN (R 4.1.0) ## highr 0.9 2021-04-16 [1] CRAN (R 4.1.0) ## hms 1.1.0 2021-05-17 [1] CRAN (R 4.1.0) ## htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0) ## httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.0) ## igraph 1.2.6 2020-10-06 [1] CRAN (R 4.1.0) ## janeaustenr 0.1.5 2017-06-10 [1] CRAN (R 4.1.0) ## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.1.0) ## jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.1.0) ## knitr 1.33 2021-04-24 [1] CRAN (R 4.1.0) ## labeling 0.4.2 2020-10-20 [1] CRAN (R 4.1.0) ## lattice 0.20-44 2021-05-02 [1] CRAN (R 4.1.0) ## lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0) ## lubridate 1.7.10 2021-02-26 [1] CRAN (R 4.1.0) ## magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0) ## markdown 1.1 2019-08-07 [1] CRAN (R 4.1.0) ## MASS 7.3-54 2021-05-03 [1] CRAN (R 4.1.0) ## Matrix 1.3-4 2021-06-01 [1] CRAN (R 4.1.0) ## memoise 2.0.0 2021-01-26 [1] CRAN (R 4.1.0) ## modelr 0.1.8 2020-05-19 [1] CRAN (R 4.1.0) ## munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0) ## pillar 1.6.2 2021-07-29 [1] CRAN (R 4.1.0) ## pkgbuild 1.2.0 2020-12-15 [1] CRAN (R 4.1.0) ## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0) ## pkgload 1.2.1 2021-04-06 [1] CRAN (R 4.1.0) ## polyclip 1.10-0 2019-03-14 [1] CRAN (R 4.1.0) ## prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.1.0) ## processx 3.5.2 2021-04-30 [1] CRAN (R 4.1.0) ## ps 1.6.0 2021-02-28 [1] CRAN (R 4.1.0) ## purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.1.0) ## R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.0) ## rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.1.0) ## Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.0) ## readr * 2.0.1 2021-08-10 [1] CRAN (R 4.1.0) ## readxl 1.3.1 2019-03-13 [1] CRAN (R 4.1.0) ## remotes 2.4.0 2021-06-02 [1] CRAN (R 4.1.0) ## reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.0) ## rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0) ## rmarkdown 2.10 2021-08-06 [1] CRAN (R 4.1.0) ## rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.1.0) ## rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0) ## rvest 1.0.1 2021-07-26 [1] CRAN (R 4.1.0) ## sass 0.4.0 2021-05-12 [1] CRAN (R 4.1.0) ## scales 1.1.1 2020-05-11 [1] CRAN (R 4.1.0) ## selectr 0.4-2 2019-11-20 [1] CRAN (R 4.1.0) ## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0) ## SnowballC 0.7.0 2020-04-01 [1] CRAN (R 4.1.0) ## stringi 1.7.3 2021-07-16 [1] CRAN (R 4.1.0) ## stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.1.0) ## testthat 3.0.4 2021-07-01 [1] CRAN (R 4.1.0) ## textdata 0.4.1 2020-05-04 [1] CRAN (R 4.1.0) ## tibble * 3.1.3 2021-07-23 [1] CRAN (R 4.1.0) ## tidygraph 1.2.0 2020-05-12 [1] CRAN (R 4.1.0) ## tidyr * 1.1.3 2021-03-03 [1] CRAN (R 4.1.0) ## tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0) ## tidytext * 0.3.1 2021-04-10 [1] CRAN (R 4.1.0) ## tidyverse * 1.3.1 2021-04-15 [1] CRAN (R 4.1.0) ## tokenizers 0.2.1 2018-03-29 [1] CRAN (R 4.1.0) ## tweenr 1.0.2 2021-03-23 [1] CRAN (R 4.1.0) ## tzdb 0.1.2 2021-07-20 [1] CRAN (R 4.1.0) ## usethis 2.0.1 2021-02-10 [1] CRAN (R 4.1.0) ## utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0) ## vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0) ## viridis 0.6.1 2021-05-11 [1] CRAN (R 4.1.0) ## viridisLite 0.4.0 2021-04-13 [1] CRAN (R 4.1.0) ## vroom 1.5.4 2021-08-05 [1] CRAN (R 4.1.0) ## widyr * 0.1.4 2021-08-12 [1] CRAN (R 4.1.0) ## withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0) ## xfun 0.25 2021-08-06 [1] CRAN (R 4.1.0) ## xml2 1.3.2 2020-04-23 [1] CRAN (R 4.1.0) ## yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0) ## ## [1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library  1. There are a number of ways to obtain the lyrics for the entire soundtrack. One approach is to use rvest and web scraping to extract the lyrics from sources online. However here I used the Genius API and geniusr to systematically collect the lyrics from an authoritative (and legal) source. The code below was used to obtain the lyrics for all the songs. Note that you need to authenticate using an API token in order to use this code. library(geniusr) # Genius album ID number hamilton_id <- 131575 # retrieve track list hamilton_tracks <- get_album_tracklist_id(album_id = hamilton_id) # retrieve song lyrics hamilton_lyrics <- hamilton_tracks %>% mutate(lyrics = map(.x = song_lyrics_url, get_lyrics_url)) # unnest and clean-up hamilton <- hamilton_lyrics %>% unnest(cols = lyrics, names_repair = "universal") %>% select(song_number, line, section_name, song_name) %>% group_by(song_number) %>% # add line number mutate(line_num = row_number()) %>% # reorder columns and convert speaker to title case select(song_number, song_name, line_num, line, speaker = section_name) %>% mutate( speaker = str_to_title(speaker), line = str_replace_all(line, "’", "'") ) %>% # write to disk write_csv(path = here("static", "data", "hamilton.csv"))  ## Warning: The path argument of write_csv() is deprecated as of readr 1.4.0. ## Please use the file argument instead.  ## New names: ## * song_lyrics_url -> song_lyrics_url...3 ## * artist_name -> artist_name...7 ## * artist_name -> artist_name...13 ## * song_lyrics_url -> song_lyrics_url...14  glimpse(hamilton)  ## Rows: 0 ## Columns: 5 ## Groups: song_number [0] ##$ song_number <dbl>
## $song_name <chr> ##$ line_num    <int>
## $line <chr> ##$ speaker     <chr>

^
2. Though lyrics’ length is not always a good measure of a musical’s pacing. ^
3. I told you filtering joins would be useful one day, but you didn’t believe me! ^