Extract information from raw text (emails) by applying R, friendly regular expressions {rex} and tidy concepts

Kristijan Bakaric

2019-08-25


Motivation

Since my main focus here is to learn basics of friendly regular expressions {rex} and processing raw text from emails in a tidy fashion, I will completly neglect any available packages that deal with email data processing like for example REmail.

Secondly, when I first opened the following tutorial where they introduced Python Regex for data scientists applied on the same dataset of Fraudulent Email Corpus from Kaggle, it was a bit intimidating on how much looping and control flowing, unfriendly regular expressions were happening within. In addition, it was a bit of a long read for Sunday’s attention span :)

…therefore, I present you my newbie approach via R:

Lets begin… by introducing data

We will use the Fraudulent Email Corpus from Kaggle. It contains thousands of phishing emails sent between 1998 and 2007. They’re pretty entertaining to read. You can find the corpus here.

Import dowloaded data by reading all text lines

Now we will use rex package to make or first simple text match From r which is a patter acting as “email spliter”. We will split a column into tokens (in our case separate emails) using the tokenizers package and function unnest_tokens {tidytext} that splits a column into tokens, splitting the table into one-token-per-row.

We managed to get 3977 emails.

Now that we have one email per row of an email column, we can start extracting chunks of information from each email and store it as related columns.


What are the name and email of the sender?

Task: Find and extract the line beginning with “From:”.

regular expression via rex: (?:From|from).*


Task: Find and extract sender email.

regular expression rex: [[:alnum:]][^[:blank:]]@.[[:alnum:]]


Task: Find and extract sender name together with removing :, < and " from the string.

regular expression via rex: :.*<

What is the recipient email?


Task: Find and extract recipient line

regular expression via rex: (?:to:|To).*


Task: Find and extract recipient email.

regular expression were reused from sender

## [[:alnum:]][^[:blank:]]*@.*[[:alnum:]]
## [[1]]
## [1] "patelnikes4you@yahoo.ca"


What is Date stamp on each email?

Task: Find and extract Date of the email after finding the line with the Date itself.

regular expression via rex for extracting date: [[:digit:]]+[[:space:]][[:alpha:]]+[:space:]{4}

## [1] "31 oct 2002"

Grand finale via ggplot

Lets get one quick insight from the proccessed table. Number of emails aggregated over date stamp?