In this article, we’ll explore how to create a sample dataset for Amazon Macie and leverage it to enhance data-centric compliance and security analytics within your Amazon S3 environment. We’ll also examine the various types of credentials, document categories, and personally identifiable information (PII) detections that Macie supports. First, let’s walk through the process of generating a “getting started” sample set of artificial data that will allow you to test Macie’s capabilities and establish your own policies and alerts.
Creating a Realistic Data Sample Set in S3
We’ll utilize the amazon-macie-activity-generator, or “AMG” for short, a sample application created by AWS that generates realistic content while accessing your test account to create the data. AMG employs AWS CloudFormation, AWS Lambda, and the Python Faker library to produce a dataset featuring artificial yet realistic data classifications and access patterns to facilitate the testing of Macie’s features. Released under Amazon Software License 1.0, AMG welcomes pull requests on our GitHub repository and will monitor any issues raised to address bugs and consider new feature requests.
The diagram below illustrates a high-level architecture overview of the components established in your AWS account for AMG. For further details about these components and their interrelations, please review the CloudFormation setup script.
Depending on the data types defined in your JSON configuration template, AMG will regularly generate artificial documents for the designated S3 target through a PutObject action. By default, the CloudFormation stack utilizes a configuration file directing AMG to create a new, private S3 bucket that is accessible only to authorized AWS users or roles within the same account. All S3 objects containing fake data in this bucket have a private ACL and inherit the bucket’s access control settings. Each generated object includes the header shown in the example below and supports all fake data providers listed at Faker Documentation, along with some custom providers requested by our clients such as aws_creds, slack_creds, github_creds, and more.
Sample Report – No identification of actual persons or places is intended or should be inferred
74323 Jamie Hart Lake Joshuamouth, OR 30055-3905 1-196-191-4438x974 53001 David Union New John, HI 94740 Mastercard Amanda Wells 5135725008183484 09/26 CVV: 550 354-70-6172 242 George Plaza East Lawrencefurt, VA 37287-7620 GB73WAUS0628038988364 587 Silva Village Pearsonburgh, NM 11616-7231 LDNM1948227117807 American Express Alex Garcia 347965534580275 05/20 CID: 4758 599.335.2742 JCB 15 digit Michael Arias 210069190253121 03/27 CVC: 861
Deploying the Amazon-Macie-Activity-Generator CloudFormation Stack
You can set up AMG in your AWS account through the following methods:
- Use the CloudFormation Template: CloudFormation Template
- Or utilize this One-click CloudFormation launch stack.
Follow these steps:
- Log in to the AWS Console in a region supported by Amazon Macie, which currently includes US East (N. Virginia) and US West (Oregon).
- Choose the One-click CloudFormation launch stack or launch CloudFormation using the template provided above.
- Review our terms, check the Acknowledgement box, and then select Create.
Setting up the data takes a few minutes, and you can periodically check CloudWatch to monitor progress.
Adding the Sample Data to Macie
Now, let’s log into the Macie console and include the newly generated sample data buckets for Macie to analyze.
Note: If you don’t explicitly designate a bucket for S3 targets in CloudFormation, AMG will use the default S3 bucket created by the stack, which will be displayed in the CloudFormation stack’s output.
To integrate buckets for data classification, follow these steps:
- Log in to Amazon Macie.
- Select Integrations, and then Services.
- Choose your account, and then select Details from the Amazon S3 card.
- Select your newly created buckets for full classification, including existing data.
For further guidance on configuring Macie, refer to our getting started documentation. Macie will classify all historical and newly created data in the buckets generated by AMG, and the data will appear in the Macie console as it gets classified. Generally, you can expect the data in the sample set to be classified within 60 minutes of being selected for analysis.
Classifying Objects with Macie
To view the objects in your test sample set, navigate to Macie, open the Research tab, and select the S3 Objects index. We’ll use Macie’s regular expression search capability to find any objects written to buckets starting with “amazon-macie-activity-generator-defaults3bucket”. Enter the following text into the Macie search box and click the magnifying glass icon:
filesystem_metadata.bucket:/amazon-macie-activity-generator-defaults3bucket.*/
From this point, you can view a comprehensive breakdown of the objects classified by Macie, as well as object-specific details. Create an advanced search using Lucene Query Syntax, and save it as an alert to be matched against any newly created data.
Analyzing Accesses to Your Test Data
Besides classifying data, Macie monitors all control plane and data plane accesses to your content through CloudTrail. To view accesses to your generated environment (created periodically by AMG to simulate user activity), select Research in the Macie navigation bar, then choose the CloudTrail data index, and use the following search to identify our generated role activity:
sessionName.key:/amazon-macie-activity-generator-LambdaFunction-.*/
This search allows you to delve into user activity (IAM users, assumed roles, federated users, etc.), summarized in 5-minute aggregations (user sessions). For instance, you can observe that one of our AMG-generated users listed objects once (ListObjects) and uploaded 56 objects to S3 (PutObject) during a 5-minute timeframe.
Macie Alerts
Macie offers both predictive (machine learning-based) and basic (rule-based) alerts, including notifications for unencrypted credentials uploaded to S3 (as this activity may not adhere to compliance best practices), risky actions like data exfiltration, and user-defined alerts based on saved searches. To view alerts generated from AMG’s activity, navigate to Alerts in the Macie menu.
AMG will continue running, periodically uploading content to the specified S3 buckets. To halt AMG, simply delete the AMG CloudFormation stack and associated resources here.
Costs Involved
Macie provides a free tier that allows you to analyze up to 1GB of content monthly at no charge. By default, AMG will write approximately 10MB of objects to Amazon S3 each day, and you will incur charges for data classification once you exceed the 1GB monthly free tier. If run continuously, AMG will generate around 310MB of content per month (10MB/day x 31 days), which remains below the free tier. Any data usage beyond 1GB will be billed at the Macie rate.
For additional insights on this topic, you can check out another blog post here, or visit https://chvnci.com/?p=1466, as they are an authority on this topic.
Leave a Reply