Hugo Future Imperfect Slim

Quantifying my life

Trying to make sense of all of the data

Andew Klein

5 minutes read


This is the first of a handful of posts where I will analyze data from a Facebook Messenger group chat.

Many of you probably have a group thread with your friends. My friends happen to use Facebook Messenger, and we’ve been using it for years, so there’s a lot to mine here.

This first post will focus on wrangling the data, while the follow ups will explore and visualize the data.

Note to try to protect their identities, I tried to replace all instances of their names with characters from Game of Thrones.

Step 1: Request the Data:

Facebook lets you request lots of information they keep on you. In order to download it, you can go to:

Settings –> Your Facebook Information –> Download Your Information from

Select JSON as the format and make sure to select Messages

Optionally, you can filter to specific date ranges to cut down on the size of the data. Unfortunately I can’t seem to find an option to download specific message threads, so I need to download them all, and then later extract just the one I care about.

Step 2: Inspect the Download:

After unzipping and digging through the file structure, it looks like each chat has it’s own folder, and within the folder, there are multiple json files containing the actual content.

ls /data/facebook_data/analysis/raw_json | head -n 5
## message_10.json
## message_11.json
## message_12.json
## message_13.json
## message_14.json

Step 3: Inspect an Individual File:

Now let’s see what exactly is contained in the json files and how much work it will be to get this data into a usable format.

import json
data_path  = '/data/facebook_data/analysis/'
with open(data_path + 'raw_json/message_1.json') as f:
    json_file = json.load(f)

## dict_keys(['participants', 'messages', 'title', 'is_still_participant', 'thread_type', 'thread_path'])

Okay, so it looks like the messages key is what we want to focus on. What does that look like?

print(json.dumps(json_file['messages'][1], indent=2))
## {
##   "sender_name": "Stannis Baratheon",
##   "timestamp_ms": 1586813265532,
##   "content": "Guess he doesn\u00e2\u0080\u0099t",
##   "reactions": [
##     {
##       "reaction": "\u00f0\u009f\u0098\u0086",
##       "actor": "Theon Greyjoy"
##     }
##   ],
##   "type": "Generic"
## }

This looks pretty well structured. We have who sent the message, some sort of timestamp, the actual content, and additional information such as any reactions. One thing we’ll need to keep in mind is that some keys at the message level have nested data inside of them, such as the reactions

Step 4: Understand Encoding Issues:

Unfortunately, we see the encoding seems to be a bit funky. If you see the content & reaction have some strange encoding issues going on. Lot’s of \u’s all around. I’m guessing the content should be actually a ', but maybe it’s a special quote that has a weird encoding?

After some googling, I came across this stackoverflow post. I was surprised Facebook would output improperly encoded results, but then again, this likely doesn’t drive them much (if any) revenue…

The key line here seems to be json.loads(data).encode('latin1').decode('utf8')

Now I just need to run this correction.

Step 5: Fix Encoding Issues:

Unfortunately this errors out on numeric data, so I need to traverse the nested dictionary structure and only apply to the fields I care about. I’m sure there is a much more elegant solution here, but I took the brute force approach for now since I shouldn’t need to be doing this over and over (famous last words…).

import json
import glob

# Setup directories for files
input_file_loc = data_path + "raw_json/"
output_file_loc = data_path + "corrected_json/"
json_input_files = glob.glob(input_file_loc + "*.json")
json_output_files = [s.replace(input_file_loc, output_file_loc) for s in json_input_files]

# Create a function that cleans a json file
def clean(file_location):
  with open(file_location) as f:
    data = json.load(f)
  messages = data['messages']
  for index in range(len(messages)):
    for key in messages[index]:
      x = messages[index][key]
      if type(x) == str:
        x = x.encode('latin1').decode('utf8')
      if key == 'reactions':
        for r in x:
          r['reaction'] = r['reaction'].encode('latin1').decode('utf8')
      messages[index][key] = x
for input_file, output_file in zip(json_input_files, json_output_files):
  clean_file = clean(input_file)
  with open(output_file, 'w') as json_file:
    json.dump(clean_file, json_file)

Now let’s load in that message and see what it looks like:

with open(data_path + 'corrected_json/message_1.json') as f:
    json_file = json.load(f)

print(json.dumps(json_file[1], indent=2))
## {
##   "sender_name": "Stannis Baratheon",
##   "timestamp_ms": 1586813265532,
##   "content": "Guess he doesn\u2019t",
##   "reactions": [
##     {
##       "reaction": "\ud83d\ude06",
##       "actor": "Theon Greyjoy"
##     }
##   ],
##   "type": "Generic"
## }

This may still look funky, but R handles this gracefully when reading it in which is what we care about.

Step 6: Load and combine in R

Let’s look at how R rendered the original content. This is also a good chance to show off a great feature of RMarkdown notebooks, python & R sharing a common environment. Since we already defined the data_path above when inspecting the json files, we can access that path using py$data_path in the r chunk. This is a pretty trivial example, but shows how you can use both languages pretty seamlessly in an analysis.

library(tidyverse) # CRAN v1.3.0
library(lubridate) # CRAN v1.7.4
library(jsonlite) # CRAN v1.6.1

fromJSON(paste0(py$data_path, '/raw_json/message_1.json'))$messages[2,"content"]
## [1] "Guess he doesnâ\u0080\u0099t"

Versus how it renders the corrected content:

fromJSON(paste0(py$data_path, '/corrected_json/message_1.json'))[2, 'content']
## [1] "Guess he doesn’t"

Finally, let’s load all json files into a single datafame and store it off. I’m also creating a few helper date related columns that will come in handy in the future.

all_data <- dir(paste0(py$data_path, "corrected_json/"), full.names = TRUE) %>% 
  map(fromJSON) %>% 
  tibble(json = .) %>% 
  unnest(cols = c(json)) %>% 
  mutate(message_timestamp  = .POSIXct(timestamp_ms/1000),
         message_date = date(message_timestamp),
         message_month = (floor_date(message_date, "month")),
         message_year = year(message_timestamp))

all_data %>% head() 

write_rds(all_data, paste0(py$data_path, "r/all_data.RDS"))

#Also save off just recent data from 2018 to present
all_data %>%
  filter(message_year >= 2018) %>% 
  write_rds(paste0(py$data_path, "r/recent_data.RDS"))


This concludes the inaugural post in this series (and blog!). I hope this was informative to show people how to clean and load in one’s Facebook data. In the next post, we’ll get into the fun stuff of actually creating visuals.

Say something


Nothing yet.

Recent posts