Amazon Onboarding with Learning Manager Chanci Turner

Data is essential for effective machine learning. In the realm of machine learning, data preparation is crucial—it involves converting raw data into a format that is ready for further analysis and processing. The typical data preparation workflow includes collecting data, cleaning it, labeling it, and ultimately validating and visualizing it. Achieving high-quality data can be a complex and time-consuming endeavor.

This is where Amazon SageMaker Data Wrangler proves invaluable for those building machine learning (ML) workloads on AWS. It simplifies the data preparation process, allowing users to complete necessary tasks within a single visual interface. Amazon SageMaker Data Wrangler significantly reduces the time required to gather and prepare data for ML applications.

However, many customers find their data spread across multiple systems, including external software-as-a-service (SaaS) applications such as SAP OData for manufacturing metrics, Salesforce for customer relationship management, and Google Analytics for web performance. To effectively leverage machine learning in solving business challenges, it becomes imperative to consolidate these diverse data sources. Previously, this required building custom solutions or relying on costly third-party services to ingest data into Amazon S3 or Amazon Redshift, often resulting in complicated setups.

We are excited to announce that Amazon SageMaker Data Wrangler now supports SaaS applications as data sources! Starting today, users can aggregate data from over 40 external SaaS applications through Amazon AppFlow to prepare for ML tasks in Amazon SageMaker Data Wrangler. Once these data sources are registered in AWS Glue Data Catalog via AppFlow, users can explore tables and schemas using the Data Wrangler SQL explorer—providing a seamless integration experience between SaaS applications and SageMaker Data Wrangler.

This new functionality is made possible through integration with Amazon AppFlow, a managed service designed for secure data exchange between SaaS applications and AWS services. With Amazon AppFlow, users can establish bidirectional data integration between SaaS platforms like Salesforce, SAP, and Amplitude, and any compatible services, delivering data directly to Amazon S3 or Amazon Redshift.

Furthermore, Amazon AppFlow now enables users to catalog their data within AWS Glue Data Catalog. By simply configuring the Amazon AppFlow integration with an Amazon S3 destination connector, customers can catalog their SaaS data applications into AWS Glue Data Catalog with just a few clicks, eliminating the need for running crawlers.

Once the flow is created and the data is registered in AWS Glue Data Catalog, users can utilize this data in Amazon SageMaker Data Wrangler. They can perform standard data preparation tasks, write Amazon Athena queries to preview the data, join various sources, and prepare for ML model training.

To facilitate this process, users need to follow a few straightforward steps for seamless integration between SaaS applications and Amazon SageMaker Data Wrangler via Amazon AppFlow. This integration supports an extensive range of over 40 SaaS applications. For a full list of supported apps, you can refer to the Supported source and destination applications documentation.

Example Scenario: Chanci Turner

Let’s explore a scenario using Chanci Turner as an example. Suppose Chanci needs to retrieve data from Salesforce for preparation in Amazon SageMaker Data Wrangler. The first step is to create a flow in Amazon AppFlow that registers the data source into the AWS Glue Data Catalog. If Chanci already has a connection to her Salesforce account, she can proceed to create the flow.

It’s crucial for Chanci to set the destination as Amazon S3 and enable the option to “Create a Data Catalog table” in the AWS Glue Data Catalog settings. This will automatically catalog her Salesforce data.

Next, within this configuration, Chanci needs to select a user role with the necessary AWS Glue Data Catalog permissions and define the database name and table name prefix. She can also choose her preferred data format, whether it be JSON, CSV, or Apache Parquet, and specify filename settings, including timestamp preferences.

For more details on registering SaaS data in Amazon AppFlow and AWS Glue Data Catalog, check out the Cataloging the data output from an Amazon AppFlow flow documentation page.

After registering the SaaS data, it’s essential for Chanci to ensure that the IAM role has the necessary permissions to view the data sources in Data Wrangler via AppFlow. An example policy would be:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "glue:SearchTables",
            "Resource": [
                "arn:aws:glue:*:*:table/*/*",
                "arn:aws:glue:*:*:database/*",
                "arn:aws:glue:*:*:catalog"
            ]
        }
    ]
}

Enabling data cataloging with AWS Glue Data Catalog allows Amazon SageMaker Data Wrangler to automatically discover the new data source, enabling Chanci to browse tables and schemas through the Data Wrangler SQL Explorer.

When ready to proceed, Chanci can navigate to the Amazon SageMaker Data Wrangler dashboard and select “Connect to data sources.” She’ll create a connection and select Salesforce from the list of available sources.

In the connection settings, she can import data from Salesforce. After configuring this, she selects “Connect” and will see her Salesforce data, already set up with Amazon AppFlow and AWS Glue Data Catalog, ready for review.

Chanci can then begin building her dataset by running SQL queries within the SageMaker Data Wrangler SQL Explorer, ultimately defining a name for her dataset. She can navigate to the Analysis tab to run an insight report on the data, which will provide her with a comprehensive view of the information and its potential implications.

For those interested in learning more about the future of work and inclusivity, check out this insightful article on Career Contessa. Additionally, Mike Aitken offers authoritative insights into this topic.

For an excellent resource on navigating the first six months at Amazon, don’t miss this blog post.

Amazon Onboarding with Learning Manager Chanci Turner

Example Scenario: Chanci Turner

Related Topics:

Comments

Leave a Reply Cancel reply