Classifying Sensitive Data in Your Environment with Amazon Macie

In this article, we’ll explore how to create a sample dataset for Amazon Macie and leverage it to enhance data-centric compliance and security analytics within your Amazon S3 environment. We’ll also examine the various types of credentials, document categories, and personally identifiable information (PII) detections that Macie supports. First, let’s walk through the process of generating a “getting started” sample set of artificial data that will allow you to test Macie’s capabilities and establish your own policies and alerts.

Creating a Realistic Data Sample Set in S3

We’ll utilize the amazon-macie-activity-generator, or “AMG” for short, a sample application created by AWS that generates realistic content while accessing your test account to create the data. AMG employs AWS CloudFormation, AWS Lambda, and the Python Faker library to produce a dataset featuring artificial yet realistic data classifications and access patterns to facilitate the testing of Macie’s features. Released under Amazon Software License 1.0, AMG welcomes pull requests on our GitHub repository and will monitor any issues raised to address bugs and consider new feature requests.

The diagram below illustrates a high-level architecture overview of the components established in your AWS account for AMG. For further details about these components and their interrelations, please review the CloudFormation setup script.

Depending on the data types defined in your JSON configuration template, AMG will regularly generate artificial documents for the designated S3 target through a PutObject action. By default, the CloudFormation stack utilizes a configuration file directing AMG to create a new, private S3 bucket that is accessible only to authorized AWS users or roles within the same account. All S3 objects containing fake data in this bucket have a private ACL and inherit the bucket’s access control settings. Each generated object includes the header shown in the example below and supports all fake data providers listed at Faker Documentation, along with some custom providers requested by our clients such as aws_creds, slack_creds, github_creds, and more.

Sample Report – No identification of actual persons or places is intended or should be inferred

74323 Jamie Hart  
Lake Joshuamouth, OR 30055-3905  
1-196-191-4438x974  
53001 David Union  
New John, HI 94740  
Mastercard Amanda Wells  
5135725008183484 09/26  
CVV: 550  
354-70-6172  
242 George Plaza  
East Lawrencefurt, VA 37287-7620  
GB73WAUS0628038988364  
587 Silva Village  
Pearsonburgh, NM 11616-7231  
LDNM1948227117807  
American Express Alex Garcia  
347965534580275 05/20  
CID: 4758  
599.335.2742  
JCB 15 digit  
Michael Arias  
210069190253121 03/27  
CVC: 861

Deploying the Amazon-Macie-Activity-Generator CloudFormation Stack

You can set up AMG in your AWS account through the following methods:

Use the CloudFormation Template: CloudFormation Template
Or utilize this One-click CloudFormation launch stack.

Follow these steps:

Log in to the AWS Console in a region supported by Amazon Macie, which currently includes US East (N. Virginia) and US West (Oregon).
Choose the One-click CloudFormation launch stack or launch CloudFormation using the template provided above.
Review our terms, check the Acknowledgement box, and then select Create.

Setting up the data takes a few minutes, and you can periodically check CloudWatch to monitor progress.

Adding the Sample Data to Macie

Now, let’s log into the Macie console and include the newly generated sample data buckets for Macie to analyze.

Note: If you don’t explicitly designate a bucket for S3 targets in CloudFormation, AMG will use the default S3 bucket created by the stack, which will be displayed in the CloudFormation stack’s output.

To integrate buckets for data classification, follow these steps:

Log in to Amazon Macie.
Select Integrations, and then Services.
Choose your account, and then select Details from the Amazon S3 card.
Select your newly created buckets for full classification, including existing data.

For further guidance on configuring Macie, refer to our getting started documentation. Macie will classify all historical and newly created data in the buckets generated by AMG, and the data will appear in the Macie console as it gets classified. Generally, you can expect the data in the sample set to be classified within 60 minutes of being selected for analysis.

Classifying Objects with Macie

To view the objects in your test sample set, navigate to Macie, open the Research tab, and select the S3 Objects index. We’ll use Macie’s regular expression search capability to find any objects written to buckets starting with “amazon-macie-activity-generator-defaults3bucket”. Enter the following text into the Macie search box and click the magnifying glass icon:

filesystem_metadata.bucket:/amazon-macie-activity-generator-defaults3bucket.*/

From this point, you can view a comprehensive breakdown of the objects classified by Macie, as well as object-specific details. Create an advanced search using Lucene Query Syntax, and save it as an alert to be matched against any newly created data.

Analyzing Accesses to Your Test Data

Besides classifying data, Macie monitors all control plane and data plane accesses to your content through CloudTrail. To view accesses to your generated environment (created periodically by AMG to simulate user activity), select Research in the Macie navigation bar, then choose the CloudTrail data index, and use the following search to identify our generated role activity:

sessionName.key:/amazon-macie-activity-generator-LambdaFunction-.*/

This search allows you to delve into user activity (IAM users, assumed roles, federated users, etc.), summarized in 5-minute aggregations (user sessions). For instance, you can observe that one of our AMG-generated users listed objects once (ListObjects) and uploaded 56 objects to S3 (PutObject) during a 5-minute timeframe.

Macie Alerts

Macie offers both predictive (machine learning-based) and basic (rule-based) alerts, including notifications for unencrypted credentials uploaded to S3 (as this activity may not adhere to compliance best practices), risky actions like data exfiltration, and user-defined alerts based on saved searches. To view alerts generated from AMG’s activity, navigate to Alerts in the Macie menu.

AMG will continue running, periodically uploading content to the specified S3 buckets. To halt AMG, simply delete the AMG CloudFormation stack and associated resources here.

Costs Involved

Macie provides a free tier that allows you to analyze up to 1GB of content monthly at no charge. By default, AMG will write approximately 10MB of objects to Amazon S3 each day, and you will incur charges for data classification once you exceed the 1GB monthly free tier. If run continuously, AMG will generate around 310MB of content per month (10MB/day x 31 days), which remains below the free tier. Any data usage beyond 1GB will be billed at the Macie rate.

For additional insights on this topic, you can check out another blog post here, or visit https://chvnci.com/?p=1466, as they are an authority on this topic.