Identifying Email Phishing Attempts with Amazon Comprehend

Phishing refers to the practice of attempting to obtain sensitive information like usernames, passwords, and credit card numbers by pretending to be a trustworthy entity through emails, phone calls, or text messages. Various forms of phishing exist, each targeting different victims. In the case of email phishing, messages are sent to a group of individuals. While traditional rule-based methods exist to detect phishing emails, emerging trends are becoming increasingly challenging for these approaches. Therefore, there is a growing need to incorporate machine learning (ML) techniques to enhance traditional methods for identifying email phishing attempts.

In this blog post, we will explore how to utilize Amazon Comprehend Custom to train and deploy an ML model that classifies whether an email is a phishing attempt. Amazon Comprehend is a natural language processing (NLP) service that leverages ML to extract valuable insights and connections from text. It can identify the language of the text, extract key phrases, recognize places, people, brands, or events, gauge sentiment towards products or services, and summarize the main topics from a collection of documents. You can tailor Amazon Comprehend to meet your specific needs without requiring an extensive background in ML-based NLP solutions. Comprehend Custom creates customized NLP models for you, using the training data you supply. It supports both custom classification and entity recognition.

Solution Overview

This article outlines how to use Amazon Comprehend to easily train and host an ML model designed to detect phishing attempts. The diagram below illustrates the phishing detection process.

You can integrate this solution with your email servers, allowing emails to be filtered through this phishing detector. When an email is identified as a phishing attempt, the recipient still receives the email, but they can also see an additional banner that serves as a warning. This solution can be used for experimentation; however, AWS recommends developing a training pipeline tailored to your specific environment. For guidance on creating a classification pipeline with Amazon Comprehend, refer to another blog post here.

We will guide you through the following steps to create the phishing detection model:

Gather and prepare the dataset.
Upload the data to an Amazon Simple Storage Service (Amazon S3) bucket.
Create the Amazon Comprehend custom classification model.
Set up the Amazon Comprehend custom classification model endpoint.
Test the model.

Prerequisites

Before you begin, please ensure you have completed the following prerequisites:

Set up an AWS account.
Create an S3 bucket; for help, see the guide on creating your first S3 bucket.
Download the email-trainingdata.csv file and upload it to your S3 bucket.

Collecting and Preparing the Dataset

Your training dataset should include both phishing and non-phishing emails. Employees within your organization should be encouraged to report phishing attempts through their email clients. Gather these reports along with examples of legitimate emails to compile your training data. It’s essential to have at least 10 examples for each class. Mark phishing emails clearly as “phishing” and legit emails as “nonphishing.” To improve performance on classification tasks with new inputs, aim for hundreds of examples per class.

For custom classification, you can train the model in either single-label or multi-label mode. In this case, we will employ single-label mode—each document will be categorized as either phishing or non-phishing. The categories are mutually exclusive; for instance, an email can be classified as phishing or not, but not both.

Custom classification allows models to be trained with either plain-text documents or native documents (such as PDFs, Word files, or images). For more details on classifier models and supported document types, refer to the documentation on training classification models. For our case, we will train a plain-text model using CSV format. Each row in the CSV file consists of a class label in the first column and an example text document in the second column, with each row concluding with n or rn characters.

Here’s a quick example of a CSV file containing two documents:

CLASS,Text of document 1
CLASS,Text of document 2

For our phishing classifier, the CSV might look like this:

phishing,"Hi, we need account details and SSN information to complete the payment. Please furnish your credit card details in the attached form."
nonphishing,"Dear Sir / Madam, your latest statement was mailed to your communication address. After your payment is received, you will receive a confirmation text message at your mobile number. Thanks, customer support"

For additional guidance on preparing your training documents, check out the resources on preparing classifier training data.

Uploading Data to the S3 Bucket

Once your training data is prepared in CSV format, load it into the S3 bucket you created earlier. For detailed instructions, see the guide on uploading objects.

Creating the Amazon Comprehend Custom Classification Model

Custom classification supports two types of classifier models: plain-text and native document models. A plain-text model classifies documents based solely on their textual content. You can train this model using documents in several languages, including English, Spanish, German, Italian, French, or Portuguese. Ensure that all training documents for a given classifier are in the same language. A native document model can process semi-structured documents, including PDFs and Microsoft Word files, and classifies documents based on both text content and layout information. AWS recommends using a plain-text model for plain-text documents and a native model for semi-structured documents.

The data specification for the custom classification model can be visualized as follows.

You can train a custom classifier using either the Amazon Comprehend console or its API. Expect the classification model creation process to take several minutes to a few hours, depending on the size of your input documents.

Conclusion

In conclusion, leveraging Amazon Comprehend for email phishing detection enhances security measures by incorporating machine learning techniques. This approach not only improves detection accuracy but also supports organizations in safeguarding sensitive information. For further insights, refer to this excellent resource.