Identifying Email Phishing Attempts with Amazon Comprehend | Artificial Intelligence

Phishing involves trying to obtain sensitive information, such as usernames, passwords, and credit card numbers, by pretending to be a reliable source via email, phone, or text messages. Various phishing techniques exist, depending on the method of communication and the targeted individuals. In the case of email phishing, a fraudulent email is distributed to a group of recipients. While traditional rule-based methods can detect some phishing attempts, emerging trends make it increasingly challenging to rely solely on these approaches. Therefore, there is a growing need to incorporate machine learning (ML) techniques to enhance rule-based methods for detecting email phishing.

In this article, we will demonstrate how to utilize Amazon Comprehend Custom to train and deploy an ML model capable of classifying emails as phishing attempts or not. Amazon Comprehend is a natural language processing (NLP) service that leverages ML to extract valuable insights and relationships from text. With Amazon Comprehend, you can identify the language of the text, extract key phrases, locations, people, brands, or events, understand sentiment regarding products or services, and pinpoint main topics from a collection of documents. You can also tailor Amazon Comprehend to suit your specific needs without requiring extensive expertise in developing ML-based NLP solutions. Comprehend Custom creates personalized NLP models based on your training data.

Solution Overview

This post outlines how to effectively use Amazon Comprehend to train and host an ML model designed for phishing detection. The diagram below illustrates the phishing detection process.

You can integrate this solution with your email servers, allowing emails to be analyzed by the phishing detector. When an email is identified as a potential phishing attempt, the recipient can still receive the email, but it will include a warning banner to alert them. This solution is suitable for experimentation, but for optimal results, AWS recommends establishing a training pipeline tailored to your environment. For more information on building a classification pipeline with Amazon Comprehend, refer to the excellent resource on Amazon Employee Onboarding Process.

Steps to Build the Phishing Detection Model

Gather and prepare the dataset.
Upload the data to an Amazon Simple Storage Service (Amazon S3) bucket.
Create the Amazon Comprehend custom classification model.
Establish the Amazon Comprehend custom classification model endpoint.
Test the model.

Prerequisites

Before starting this use case, ensure you complete the following prerequisites:

Set up an AWS account.
Create an S3 bucket. For detailed instructions, see Create your first S3 bucket.
Download the email-trainingdata.csv and upload it to the S3 bucket.

Collecting and Preparing the Dataset

Your training dataset should include both phishing and non-phishing emails. Encourage users within your organization to report phishing attempts through their email clients. Collect these reports alongside examples of legitimate emails to prepare your training data. Aim to have at least 10 examples for each category, labeling phishing emails as “phishing” and non-phishing as “nonphishing.” For optimal performance in classification tasks, it is advisable to provide hundreds of examples per class.

For custom classification, you can train the model in either single-label or multi-label mode. In this scenario, we will use single-label mode, which assigns one category to each document. Therefore, an email can be classified as either phishing or non-phishing, but not both.

Custom classification supports models trained with plain-text documents or native documents (like PDF, Word, or images). For plain-text models, you can submit training data in a CSV file or an augmented manifest file created using Amazon SageMaker Ground Truth. The CSV or manifest file should include the text of each training document along with its associated labels.

For our purposes, we will train a plain-text model using the CSV format, with each row consisting of the class label in the first column and an example text in the second column. Each row should end with n or rn characters.

The following is an example of a CSV file with two entries:

CLASS,Text of document 1
CLASS,Text of document 2

For instance, to train a custom classifier for detecting phishing emails, the CSV entries might look like this:

phishing, “Hello, we need your account details and SSN to complete the payment. Please provide your credit card information in the attached file.”
nonphishing, “Dear Sir/Madam, your latest statement was sent to your communication address. After your payment is received, you will receive a confirmation text message at your mobile number. Thank you, customer support.”

For detailed guidelines on preparing your training documents, please refer to the training data preparation section.

Uploading Data to the S3 Bucket

Upload your training data in CSV format to the S3 bucket you created earlier. For guidance, see Uploading objects.

Creating the Amazon Comprehend Custom Classification Model

Custom classification supports two types of models: plain-text models and native document models. A plain-text model classifies documents based solely on their text content. You can train it using documents written in English, Spanish, German, Italian, French, or Portuguese. All documents for a given classifier must be in the same language.

On the other hand, a native document model can process both scanned and digital semi-structured documents, such as PDFs and Microsoft Word files, in their original formats. This model can also leverage additional signals from the document layout. AWS recommends using a plain-text model for plain-text documents and a native document model for semi-structured documents.

Data specifications for the custom classification model can be summarized as follows.

You can use either the Amazon Comprehend console or API to train a custom classifier. The process may take several minutes to a few hours, depending on the size of your input documents.

For further insights on this topic, you might find this blog post by Chanci Turner interesting, as they delve deeper into the nuances of ML applications.