Since my main focus here is to learn basics of friendly regular expressions {rex} and processing raw text from emails in a tidy fashion, I will completly neglect any available packages that deal with email data processing like for example REmail.
Secondly, when I first opened the following tutorial where they introduced Python Regex for data scientists applied on the same dataset of Fraudulent Email Corpus from Kaggle, it was a bit intimidating on how much looping and control flowing, unfriendly regular expressions were happening within. In addition, it was a bit of a long read for Sunday’s attention span :)
…therefore, I present you my newbie approach via R:
We will use the Fraudulent Email Corpus from Kaggle. It contains thousands of phishing emails sent between 1998 and 2007. They’re pretty entertaining to read. You can find the corpus here.
#install.packages("dplyr")
#install.packages("rex")
#install.packages("tidytext")
#install.packages("printr")
library(dplyr) # for data wrangling
library(stringr) # stringr: Simple, Consistent Wrappers for Common String Operations
library(rex) # Generate a regular expression.
rex_mode() # While within rex mode, functions used within the rex function are attached, so one can get e.g. auto-completion within editors.
library(tidytext) # This package implements tidy data principles to make many text mining tasks easier, more effective, and consistent with tools already in wide use.
library(printr) # Print R objects in knitr documents nicely
library(lubridate) # for parsing date strings
# DT::datatable() # will be used expliocitely for nice html tables
# devtools::install_github("mkearney/kaggler")
# get api token from kaggle
# "https://www.kaggle.com/rtatman/fraudulent-email-corpus/downloads/fradulent_emails.txt/"
# kaggler::kgl_auth(username = "username", key = "key")
#kaggler::kgl_datasets_list(owner_dataset = "rtatman/fraudulent-email-corpus")
fradulent_emails <- readLines("dl/fradulent_emails.txt")
Now we will use rex
package to make or first simple text match From r which is a patter acting as “email spliter”. We will split a column into tokens (in our case separate emails) using the tokenizers package and function unnest_tokens {tidytext}
that splits a column into tokens, splitting the table into one-token-per-row.
# Convert to a table
df <- tibble(text = fradulent_emails)
# our regex token as email splitter
re <- rex(
"From r"
)
d_split <- df %>%
unnest_tokens(emails, text, token = "regex", pattern = re)
We managed to get 3977 emails.
Now that we have one email per row of an email column, we can start extracting chunks of information from each email and store it as related columns.
Task: Find and extract the line beginning with “From:”.
regular expression via rex: (?:From|from).*
#Define the regular expression:
re <- rex(
or("From", "from"),
shortcuts$anything,
shortcuts$newline
)
# store in a tibble as a new column "from_line"
d_split <- d_split %>%
mutate(from_line = str_extract(string = d_split$emails , pattern = re))
DT::datatable(d_split[1:5,] %>% select(from_line))
Task: Find and extract sender email.
regular expression rex: [[:alnum:]][^[:blank:]]@.[[:alnum:]]
# define rex
re_email <- rex(
shortcuts$alnum,
shortcuts$any_non_blanks,
"@",
shortcuts$anything,
shortcuts$alnum)
# test
str_view(string = d_split[1, ]$from_line , pattern = re_email)
# store extracted email in "s_email" column
d_split <- d_split %>%
mutate(s_email = str_extract(string = d_split$from_line , pattern = re_email))
DT::datatable(d_split[1:5,] %>% select(from_line, s_email))
Task: Find and extract sender name together with removing :, < and " from the string.
regular expression via rex: :.*<
# define rex
re_name <- rex(
":",
shortcuts$anything,
"<"
)
#test
str_view(string = d_split[1, ]$from_line , pattern = re_name)
# store extracted sender name in "s_name" column
d_split <- d_split %>%
mutate(s_name = str_extract(string = from_line , pattern = re_name),
s_name = str_remove_all(string = s_name, pattern = regex(":|<|\""))) ## used or regex or(|) operator for the sake of simplicity
DT::datatable(d_split[1:5,] %>% select(from_line, s_name))
Task: Find and extract recipient line
regular expression via rex: (?:to:|To).*
# another approach where column emails was piped rex was matched from capture group
# mutated into a new column and binded with the rest of the table
to_line_column <- d_split$emails %>%
re_matches(rex(
shortcuts$newline,
or("to:","To"),
capture(shortcuts$anything, name = "to_line"),
shortcuts$newline
)) %>%
mutate(to_line = to_line)
# bind to_line_column
d_split <- bind_cols(d_split, to_line_column)
DT::datatable(d_split[1:5,] %>% select(to_line))
Task: Find and extract recipient email.
regular expression were reused from sender
## [[:alnum:]][^[:blank:]]*@.*[[:alnum:]]
## [[1]]
## [1] "patelnikes4you@yahoo.ca"
# extract sender email and name and store it into r_email and r_name columns
d_split <- d_split %>%
mutate(r_email = str_extract(string = d_split$to_line , pattern = re_email))
DT::datatable(d_split[461:466,] %>% select(to_line, r_email))
Task: Find and extract Date of the email after finding the line with the Date itself.
regular expression via rex for extracting date: [[:digit:]]+[[:space:]][[:alpha:]]+[:space:]{4}
# Step 1: find the line with the date
re_date_line <- rex(
or("Date:","date"),
shortcuts$anything,
shortcuts$newline
)
# test
str_view(string = d_split[1, ]$emails , pattern = re_date_line)
# store in tibble as "date_line" column
d_split <- d_split %>%
mutate(date_line = str_extract(string = d_split$emails , pattern = re_date_line))
re_date <- rex(
shortcuts$digits,
shortcuts$space,
shortcuts$alphas,
shortcuts$space,
n_times(shortcuts$digit, n = 4)
)
str_extract(string = d_split[1, ]$date_line , pattern = re_date)
## [1] "31 oct 2002"
Lets get one quick insight from the proccessed table. Number of emails aggregated over date stamp?
library(ggplot2)
library(hrbrthemes)
p <- d_split %>%
mutate(year = lubridate::year(date_sent)) %>%
group_by(year) %>%
count() %>%
ggplot() +
geom_line(aes(x = year, y = n), color = "forestgreen", size = 1) +
theme_modern_rc() +
scale_y_log10()+
labs(y = "Count of emails (log10transformed)",
title = "Count of Emails Over Years")
plotly::ggplotly(p)