Learn About Amazon VGT2 Learning Manager Chanci Turner
We are excited to announce the general availability of our new data preparation authoring feature in AWS Glue Studio, which is tailored for business users and data analysts. This intuitive no-code interface, designed with a spreadsheet-style layout, allows users to efficiently run data integration jobs using AWS Glue for Spark. With this new visual experience, data analysts and scientists can easily clean and transform data, making it ready for analytics and machine learning (ML) applications. Users can choose from numerous pre-built transformations to automate their data preparation tasks without any coding required.
Business analysts can collaborate seamlessly with data engineers to create data integration jobs. Data engineers can utilize the Glue Studio’s visual flow-based interface to establish connections to data sources and dictate the order of the data flow process. Meanwhile, business analysts can define the necessary data transformations and outputs. Furthermore, you can import your existing AWS Glue DataBrew data cleansing and preparation “recipes” into this new environment, enabling you to continue authoring directly within AWS Glue Studio while scaling recipes to handle petabytes of data at a lower cost for AWS Glue jobs.
Visual ETL Prerequisites
To utilize the visual ETL feature, users must have the AWSGlueConsoleFullAccess IAM managed policy linked to their user accounts. This policy provides full access to AWS Glue and read access to Amazon Simple Storage Service (Amazon S3) resources.
Advanced Visual ETL Flows
Once the required AWS Identity and Access Management (IAM) role permissions are established, you can start authoring your visual ETL processes using AWS Glue Studio.
Extract
Begin by adding an Amazon S3 node from the list of sources. Select the newly created node and navigate to your S3 dataset. After successfully uploading your file, click “Infer schema” to configure the source node, allowing the visual interface to display a preview of the data found in the .csv file. I previously created an S3 bucket within the same region as the AWS Glue visual ETL and uploaded a .csv file named “visual ETL conference data.csv” containing the data I wish to visualize. It is crucial to set the role permissions correctly, as outlined earlier, to grant AWS Glue access to read the S3 bucket; failure to do so will result in an error that prevents you from previewing the data.
Transform
After configuring the node, you can add a Data Preparation Recipe and initiate a data preview session, which typically takes about 2 – 3 minutes. Once the session is ready, select “Author Recipe” to start adding transformations as the data frame stabilizes. Throughout the authoring session, you can view the data, apply transformation steps, and see the results interactively. You also have options to undo, redo, and rearrange the steps. The interface allows you to visualize column data types and the statistical properties of each column. For example, to focus on conferences held in South Africa, I created two recipes that filter based on conditions where the Location column equals “South Africa,” and the Comments column contains a value.
Load
Once your data is prepared, you can share your work with data engineers who can enhance it with more advanced visual ETL flows and custom coding for seamless integration into their production data pipelines.
Now Available
The AWS Glue data preparation authoring feature is now available across all commercial AWS Regions where AWS DataBrew operates. For more details, please visit AWS Glue, check out this video, and read the AWS Big Data blog.
If you’re considering a job change, this blog post on knowing when to quit might be helpful. For additional information on employment law, you can refer to this authority on the Montana Wrongful Discharge Act. Finally, for those starting their journey with Amazon, this Reddit thread is an excellent resource.
— Chanci Turner
Leave a Reply