Extract information from raw text (emails) by applying R, friendly regular expressions {rex} and tidy concepts

Kristijan Bakaric

2019-08-25

Motivation
Lets begin… by introducing data
Grand finale via ggplot

Motivation

Since my main focus here is to learn basics of friendly regular expressions {rex} and processing raw text from emails in a tidy fashion, I will completly neglect any available packages that deal with email data processing like for example REmail.

Secondly, when I first opened the following tutorial where they introduced Python Regex for data scientists applied on the same dataset of Fraudulent Email Corpus from Kaggle, it was a bit intimidating on how much looping and control flowing, unfriendly regular expressions were happening within. In addition, it was a bit of a long read for Sunday’s attention span :)

…therefore, I present you my newbie approach via R:

Lets begin… by introducing data

We will use the Fraudulent Email Corpus from Kaggle. It contains thousands of phishing emails sent between 1998 and 2007. They’re pretty entertaining to read. You can find the corpus here.

Load libraries

#install.packages("dplyr")
#install.packages("rex")
#install.packages("tidytext")
#install.packages("printr")

library(dplyr) # for data wrangling
library(stringr) # stringr: Simple, Consistent Wrappers for Common String Operations
library(rex) # Generate a regular expression.
rex_mode() # While within rex mode, functions used within the rex function are attached, so one can get e.g. auto-completion within editors.
library(tidytext) # This package implements tidy data principles to make many text mining tasks easier, more effective, and consistent with tools already in wide use.
library(printr) # Print R objects in knitr documents nicely
library(lubridate) # for parsing date strings

# DT::datatable() # will be used expliocitely for nice html tables

Import dowloaded data by reading all text lines

# devtools::install_github("mkearney/kaggler")
# get api token from kaggle
# "https://www.kaggle.com/rtatman/fraudulent-email-corpus/downloads/fradulent_emails.txt/"
# kaggler::kgl_auth(username = "username", key = "key")
#kaggler::kgl_datasets_list(owner_dataset = "rtatman/fraudulent-email-corpus")

fradulent_emails <- readLines("dl/fradulent_emails.txt")

Now we will use rex package to make or first simple text match From r which is a patter acting as “email spliter”. We will split a column into tokens (in our case separate emails) using the tokenizers package and function unnest_tokens {tidytext} that splits a column into tokens, splitting the table into one-token-per-row.

# Convert to a table
df <- tibble(text = fradulent_emails)

# our regex token as email splitter
re <- rex(
   "From r" 
 )
d_split <- df %>%
  unnest_tokens(emails, text, token = "regex", pattern = re)

We managed to get 3977 emails.

Now that we have one email per row of an email column, we can start extracting chunks of information from each email and store it as related columns.

What are the name and email of the sender?

Task: Find and extract the line beginning with “From:”.

regular expression via rex: (?:From|from).*

#Define the regular expression:
re <- rex(
  or("From", "from"),
  shortcuts$anything,
  shortcuts$newline
)
# store in a tibble as a new column "from_line"
d_split <- d_split %>% 
  mutate(from_line = str_extract(string = d_split$emails , pattern = re))

DT::datatable(d_split[1:5,] %>% select(from_line))

Task: Find and extract sender email.

regular expression rex: [[:alnum:]][^[:blank:]]@.[[:alnum:]]

# define rex
re_email <- rex(
  shortcuts$alnum,
  shortcuts$any_non_blanks,
  "@",
  shortcuts$anything,
  shortcuts$alnum)

# test
str_view(string = d_split[1, ]$from_line , pattern = re_email)

# store extracted email in "s_email" column
d_split <- d_split %>% 
  mutate(s_email = str_extract(string = d_split$from_line , pattern = re_email))

DT::datatable(d_split[1:5,] %>% select(from_line, s_email))

Task: Find and extract sender name together with removing :, < and " from the string.

regular expression via rex: :.*<

# define rex
re_name <- rex(
  ":",
  shortcuts$anything,
  "<"
)
#test
str_view(string = d_split[1, ]$from_line , pattern = re_name)

# store extracted sender name in "s_name" column
d_split <- d_split %>% 
  mutate(s_name = str_extract(string = from_line , pattern = re_name),
    s_name = str_remove_all(string = s_name, pattern = regex(":|<|\""))) ## used or regex or(|) operator for the sake of simplicity

DT::datatable(d_split[1:5,] %>%  select(from_line, s_name))

What is the recipient email?

Task: Find and extract recipient line

regular expression via rex: (?:to:|To).*

# another approach where column emails was piped rex was matched from capture group 
# mutated into a new column and binded with the rest of the table

to_line_column <- d_split$emails %>% 
  re_matches(rex(
    shortcuts$newline,
    or("to:","To"),
    capture(shortcuts$anything, name = "to_line"),
    shortcuts$newline
  )) %>% 
  mutate(to_line = to_line)

# bind to_line_column
d_split <- bind_cols(d_split, to_line_column)

DT::datatable(d_split[1:5,] %>%  select(to_line))

Task: Find and extract recipient email.

regular expression were reused from sender

re_email # reuse email regex from sender receipient

## [[:alnum:]][^[:blank:]]*@.*[[:alnum:]]

str_extract_all(string = d_split[461, ]$to_line , pattern = re_email)

## [[1]]
## [1] "patelnikes4you@yahoo.ca"

# extract sender email and name and store it into r_email and r_name columns
d_split <- d_split %>% 
  mutate(r_email = str_extract(string = d_split$to_line , pattern = re_email))

DT::datatable(d_split[461:466,] %>%  select(to_line, r_email))

What is Date stamp on each email?

Task: Find and extract Date of the email after finding the line with the Date itself.

regular expression via rex for extracting date: [[:digit:]]+[[:space:]][[:alpha:]]+[:space:]{4}

# Step 1: find the line with the date
re_date_line <- rex(
  or("Date:","date"),
  shortcuts$anything,
  shortcuts$newline
)
# test
str_view(string = d_split[1, ]$emails , pattern = re_date_line)

# store in tibble as "date_line" column
d_split <- d_split %>% 
  mutate(date_line = str_extract(string = d_split$emails , pattern = re_date_line))

re_date <- rex(
    shortcuts$digits,
    shortcuts$space,
    shortcuts$alphas,
    shortcuts$space,
    n_times(shortcuts$digit, n = 4)
  )

str_extract(string = d_split[1, ]$date_line , pattern = re_date)

## [1] "31 oct 2002"

d_split <- d_split %>% 
  mutate(date_sent = str_extract(string = d_split$date_line , pattern = re_date),
    date_sent = lubridate::dmy(date_sent))

DT::datatable(d_split[1:5,] %>%  select(date_line, date_sent))

Grand finale via ggplot

Lets get one quick insight from the proccessed table. Number of emails aggregated over date stamp?

library(ggplot2)
library(hrbrthemes)

p <- d_split %>% 
  mutate(year = lubridate::year(date_sent)) %>% 
  group_by(year) %>% 
  count() %>% 
  ggplot() +
  geom_line(aes(x = year, y = n), color = "forestgreen", size = 1) +
  theme_modern_rc() +
  scale_y_log10()+
  labs(y = "Count of emails (log10transformed)",
    title = "Count of Emails Over Years")

plotly::ggplotly(p)