Hugo Future Imperfect Slim

Quantifying my life

Trying to make sense of all of the data

Andrew Klein

14 minutes read

Introduction

This post will focus on answering the question “Who’s talking the most?”

Alright, now that we’ve gotten all the boring pre-processing done, let’s start to explore the data. The original question I tried to answer was “who’s talking the most?” There are a few ways to look at that. This post will focus on how I built up a particular graph and my thought process, as opposed to immediately presenting the final graph. It also became an excuse for me to play with some packages I haven’t had a chance to mess around with before, such as gganimate, rayshader, & the tidyverts universe of packages.

Note some of the graphs can be hard to read, so click to expand them.

Step 1: Load the data

Let’s refresh ourselves on what the data looks like:

library(tidyverse) # CRAN v1.3.0
library(scales) # CRAN v1.1.0
library(ggrepel) # CRAN v0.8.2
library(gganimate) # CRAN v1.0.5
library(sugrrants) # CRAN v0.2.6
library(patchwork) # CRAN v1.0.0
library(lubridate) # CRAN v1.7.4
library(hrbrthemes) # CRAN v0.8.0 
library(rayshader) # CRAN v0.13.7
library(tsibble) # CRAN v0.8.6
library(feasts) # CRAN v0.1.3

theme_set(theme_ipsum())
dat_loc <- '/data/facebook_data/analysis/r/'
messages <- read_rds(paste0(dat_loc, "all_data.RDS")) %>%
  filter(!is.na(content)) %>%  # Removing some rows that aren't related to content
  filter(message_month != max(message_month)) %>% # Removing a partial month 
  filter(sender_name != "Eddard Stark") # Removing someone who 'died' in Season 1

messages %>% head()
## # A tibble: 6 x 17
##   sender_name timestamp_ms content reactions type  photos gifs  videos
##   <chr>              <dbl> <chr>   <list>    <chr> <list> <lis> <list>
## 1 Petyr Bael…      1.59e12 Sweet … <df[,2] … Gene… <NULL> <NUL… <NULL>
## 2 Theon Grey…      1.59e12 Sleep … <NULL>    Gene… <NULL> <NUL… <NULL>
## 3 Theon Grey…      1.59e12 Lolol   <NULL>    Gene… <NULL> <NUL… <NULL>
## 4 Petyr Bael…      1.59e12 I have… <df[,2] … Gene… <NULL> <NUL… <NULL>
## 5 Petyr Bael…      1.59e12 We’re … <NULL>    Gene… <NULL> <NUL… <NULL>
## 6 Theon Grey…      1.59e12 Nice. … <df[,2] … Gene… <NULL> <NUL… <NULL>
## # … with 10 more variables: share$link <chr>, $share_text <chr>,
## #   sticker$uri <chr>, users <list>, files <list>, audio_files <list>,
## #   message_timestamp <dttm>, message_date <date>, message_month <date>,
## #   message_year <dbl>

The columns we’ll care most about for this analysis are sender_name, content, and the message_timestamp related columns.

Step 2: Total counts

The most straightforward way to answer this question is to just look at total count of messages. We see Theon Greyjoy has sent the most amount of messages, almost 10,000 then the next chattiest.

messages %>% 
  count(sender_name) %>%
  ggplot(aes(x = fct_reorder(sender_name,n) , y = n)) + 
  geom_bar(stat = 'identity') +
  scale_x_discrete("Sender") +
  scale_y_continuous("Total # of Messages Sent", label = comma) + 
  coord_flip() +
  labs(title = "# Of Messages by Sender")

Since I’m using ggplot2 3.3.0, I can actually take advantage of the new bi-directional geoms (as discussed here) and clean up some other code as well. I’m not sure where I picked up the habit, but I always would use geom_bar(stat = 'identity') instead of the simpler geom_col. The below gets us the same graph, but much cleaner code.

messages %>% 
  count(sender_name) %>%
  ggplot(aes(x = n, y = fct_reorder(sender_name,n))) + 
  geom_col() +
  scale_x_continuous("Total # of Messages Sent", label = comma) + 
  scale_y_discrete("Sender") +
  labs(title = "# Of Messages by Sender")

Step 3: Words and characters

Knowing the types of messages Theon Greyjoy sends, I know they tend to be lots of small bursts. Things might change if we count total number of words, or characters. Let’s add a few helper columns to the dataset first.

messages <- messages %>%
  mutate(n_words = str_count(content, '\\w+'),
         n_chars = nchar(content))

Quickly doing a sanity check to make sure the code works as expected.

messages %>% 
  select(content, n_words, n_chars) %>% 
  head()
## # A tibble: 6 x 3
##   content                                       n_words n_chars
##   <chr>                                           <int>   <int>
## 1 Sweet dreams Benjamin                               3      21
## 2 Sleep time                                          2      10
## 3 Lolol                                               1       5
## 4 I have no words                                     4      15
## 5 We’re gambling on people playing a video game       9      45
## 6 Nice. Man our.lives went downhill                   6      33

Next, let’s see if there is anyone that tends to write a lot more words / characters per message than everyone else.

counts <- messages %>% 
  group_by(sender_name) %>%
  summarize(n_messages = n(),
            n_words = sum(n_words),
            n_chars = sum(n_chars))

Finally, we can plot each of these new columns against the # of messages to see if things actually change.

words_plt <- ggplot(counts, aes(x = n_messages, y= n_words, label = sender_name)) +
  stat_smooth(method = 'lm', se = FALSE, color = 'lightgrey',
              linetype = 'dashed', formula = 'y ~ x') + 
  geom_point() +
  geom_label_repel(aes(color = sender_name)) +
  scale_x_continuous("# of Messages Sent", label = comma) +
  scale_y_continuous("# of Words Used", label =comma) +
  labs(title = "# of Messages vs. # of Words") +
  scale_color_discrete(guide = FALSE) 

chars_plt <- ggplot(counts, aes(x = n_messages, y= n_chars, label = sender_name)) +
  stat_smooth(method = 'lm', se = FALSE, color = 'lightgrey',
              linetype = 'dashed', formula = 'y ~ x') + 
  geom_point() +
  geom_label_repel(aes(color = sender_name)) +
  scale_x_continuous("# of Messages Sent", label = comma) +
  scale_y_continuous("# of Characters Used", label =comma) + 
  labs(title = "# of Messages vs. # of Characters") +
  scale_color_discrete(guide = FALSE) 
  
words_plt + chars_plt

Although Theon Greyjoy tends to use fewer words and characters than expected (the diagonal line), they are still 2nd most in both of these metrics

We see that Petyr Baelish is the most verbose then they talk; they consistently uses more words and characters given the # of messages they send than everyone else.

Step 6: Using tidyverts

Since this data is essentially a time-series dataset, we can use some of the tools from tidverts universe, which is the successor to the popular forecast package.

First, we need to convert the data from a tibble to a tsibble. Then, we can use a few functions from the feasts package to create some visuals pretty easily.

messages_by_month_sender_ts <-  messages_by_month_sender %>% 
  mutate(message_month = yearmonth(message_month)) %>% 
  as_tsibble(index = message_month, key = sender_name) %>%
  fill_gaps(n = 0)

messages_by_month_ts <-  messages_by_month %>% 
  mutate(message_month = yearmonth(message_month)) %>% 
  as_tsibble(index = message_month) %>%
  fill_gaps(n = 0)

Now that our data is stored as a tsibble, we can use gg_season to check for seasonality. First, let’s look at overall counts.

gg_season(messages_by_month_ts, y = n,  labels = 'right') +
  scale_y_continuous(name = "# Of Messages", label = comma) +
  xlab("Month") +
  ggtitle("Seasonality of Messages Sent")

We can clearly see that what we saw in the calendar view holds true most other years as well; August - November are the most active months. Interestingly, we see March of this year to be much more active than previous years. This is likely due to COVID-19. It only takes an global pandemic to get us all talking more than usual.

Next let’s see if there’s anyone in particular that bucks the overall seasonality trends.

gg_season(messages_by_month_sender_ts, y = n, labels = 'right') +
  scale_y_continuous(name = "# Of Messages", label = comma) +
  facet_wrap(~sender_name, scales = 'free_y', ncol = 3) +
  xlab("Month") +
  ggtitle("Seasonality of Messages Sent by Sender")

Overall, we see in general, people are getting more talkative over time (as shown by the purple line typically being the highest). What jumps out at me is Robb Stark & Samwell Tarley have huge upticks in counts in 2020. Although they’re talking the least in general, as noted by looking at the y-axis scales, they are very chatty this year. These two also are the two members that are not in the fantasy football league. Again, this is almost certainly related to the introduction of a non-fantasy football topic; COVID-19.

Step 7: Improve the original

Although I think the graph above tells a lot of information, it can be overwhelming to the reader and can be hard to read.

First, let’s focus just on recent years and cut out anything prior to 2018. We’re also going to add a column that will help us with labels. It will be the sender_name for the last month in the data, and NA for all other months.

messages_by_month_sender_recent <- messages_by_month_sender %>%
  filter(year(message_month) >= 2018) %>% 
  mutate(lbl = if_else(message_month == max(message_month), sender_name, NA_character_)) 

Next, let’s also start to play with the labels for names. My first attempt below is to use geom_text_repel from the ggrepel package to add labels directly on the graph.

ggplot(messages_by_month_sender_recent, aes(x = message_month, y = n, color = sender_name)) +
  geom_line() + 
  scale_x_date(name = "Month", date_breaks = '3 months', date_labels = "%b '%y") +
  scale_y_continuous(name = "# Of Messages", label = comma)  +
  scale_color_discrete(guide = FALSE) +
  geom_text_repel(aes(label = lbl)) 
## Warning: Removed 302 rows containing missing values (geom_text_repel).

This makes it easier to directly tie the name to the line, but it’s still too crowded. Adding a bit of hjust and expanding the limits out helps out here.

ggplot(messages_by_month_sender_recent, aes(x = message_month, y = n, color = sender_name)) +
  geom_line() + 
  scale_x_date(name = "Month", date_breaks = '3 months', date_labels = "%b '%y") +
  scale_y_continuous(name = "# Of Messages", label = comma)  +
  scale_color_discrete(guide = FALSE) +
  geom_text_repel(aes(label = lbl), hjust = -.1)  + 
  expand_limits(x = mdy("7/1/2020"))
## Warning: Removed 302 rows containing missing values (geom_text_repel).

I also found this blog post by Simon Jackson which takes advantage of the sec.axis to add names there.

last_month <- messages_by_month_sender_recent %>% 
  group_by(sender_name) %>% 
  arrange(desc(n)) %>% 
  top_n(1, message_month)


ggplot(messages_by_month_sender_recent, aes(x = message_month, y = n, color = sender_name)) +
  geom_line() + 
  scale_x_date(name = "Month", date_breaks = '3 months', date_labels = "%b '%y") +
  scale_y_continuous(name = "# Of Messages", label = comma,
                     sec.axis = sec_axis(~ ., 
                                         breaks = last_month$n, 
                                         labels = last_month$sender_name))  +
  scale_color_discrete(guide = FALSE)

I think this is an interesting way to solve this type of issue, but for this particular graph, it’s still too crowded and not color matched. I’m sure there’s a way to get the text to be the correct color, I just couldn’t easily figure it out.

I then tried the same thing, but logging the y-axis. There’s a lot of debate regarding log-scaling axes, especially these days with all of the COVID-19 graphs. I’m going to skip the debate for now and see what it looks like. Remember, the purpose of this analysis is to see who’s talking the most, so we care more about which lines are above each other more than the absolute difference between them.

ggplot(messages_by_month_sender_recent, aes(x = message_month, y = n, color = sender_name)) +
  geom_line() + 
  scale_x_date(name = "Month", date_breaks = '3 months', date_labels = "%b '%y") +
  scale_y_log10(name = "# Of Messages", label = comma)  +
  scale_color_discrete(guide = FALSE) +
  geom_text_repel(aes(label = lbl), hjust = -.1) +
  expand_limits(x = mdy("8/1/2020"))
## Warning: Removed 302 rows containing missing values (geom_text_repel).

This helps things, but is still a little crowded. The last improvement to the graph will use gganimate to help easily highlight how the values are changing over time

Step 8: Animation

g1 <- ggplot(messages_by_month_sender_recent,
             aes(x = message_month, y = n, color = sender_name)) +
  geom_line() + 
  geom_text_repel(aes(label = sender_name)) + 
  theme_ipsum() + 
  scale_x_date(name = "Month", date_breaks = '3 months', date_labels = "%b '%y") +
  scale_y_log10(name = "# Of Messages", label = comma)  +
  scale_color_discrete(guide = FALSE) +
  transition_reveal(message_month) +
  ease_aes('linear', interval  = .001)

animate(g1, renderer = gifski_renderer(), end_pause = 5, nframes = 400)

Overall, we can see that Theon Greyjoy & Walder Frey are consistently the top two posters.

Step 9: Bar Chart Race

Finally, because they seem to be trendy these days, I wanted to take a crack at animating those bar chart races you see everywhere these days. gganimate makes it pretty easy. I borrowed some tips from this blog post by https://twitter.com/coolbutuseless. I’d like to eventually replace their use of flags with images of the characters, but I’m not there yet.

First we need to get the data into shape. Note that since there are a few months where people don’t talk, we need to use the complete function to fill out the missing months and fill in values of 0.

race_dat <- messages %>%
  group_by(sender_name, message_month, .drop = FALSE) %>% 
  summarize(n = n()) %>% 
  ungroup() %>%
  complete(message_month,  sender_name, fill = list(n = 0)) %>% 
  group_by(sender_name) %>% 
  arrange(message_month) %>%
  mutate(cume_n = cumsum(n)) %>%
  ungroup() %>%
  group_by(message_month) %>%
  arrange(desc(cume_n)) %>%
  mutate(rank = row_number()) %>%
  ungroup()

Finally, we’re off to the races:

race_plt <- ggplot(race_dat, aes(group = sender_name, fill = sender_name)) + 
  geom_rect(
    aes(
      xmin = 0,
      xmax = cume_n,
      ymin = rank - 0.45,
      ymax = rank + 0.45
    )
  ) +

  geom_text(aes(y = rank, label = sender_name), x = -3000, 
            col = "gray13",  hjust = "right", size = 3) +
  scale_y_reverse() + 
  labs(x = '# Messages', fill = NULL) +  
  theme_bw() + 
  theme(
    axis.text.y     = element_blank(),
    axis.title.y    = element_blank(),
    axis.ticks.y    = element_blank(),
    legend.key.size = unit(2, 'cm'),
    axis.text       = element_text(size = 20),
    axis.title      = element_text(size = 20)
  ) +   
  scale_x_continuous(  
    limits = c(-10000, 110000),  
    breaks = c(seq(0, 110000, 10000)),
    label = comma
  ) +  
  scale_fill_discrete(guide = FALSE) +
  gganimate::transition_time(message_month) +
  ggtitle('Message Month: {month.name[month(frame_time)]} {year(frame_time)} ') 


animate(race_plt, width = 1000, height = 600, fps  = 10, nframes = 200)

Conclusion

That concludes the first real deep dive into the data. I hope you found some useful tricks for visualizing this type of data and hopefully learned about a new package or two. Next post we’ll dive deeper into the actual content of the messages being sent.

Postscript

Right before publishing this post, I remembered the flipbookr package and how it might be appropriate for this post. The package’s author, , recently posted this article on how to embed them into html pages. I was able to use this as a starting point to at least play around with the package. I think this is a great package for teaching, and hope to implement it next time I teach data visualization.

knitr::opts_chunk$set(echo = TRUE, warning = F, message = F, comment = "", fig.width = 12, fig.height = 12)
library(flipbookr)
d1<- messages %>% 
  group_by(message_month, sender_name) %>%
  count() %>% 
  ungroup() %>% 
  mutate(lbl = if_else(message_month == max(message_month), sender_name, NA_character_))

Say something

Comments

Nothing yet.

Recent posts

Categories

About

test