Amazon Onboarding with Learning Manager Chanci Turner

We are excited to announce a significant development in the integration of Amazon Redshift with Apache Spark. This new capability allows users to seamlessly build and execute Spark applications on Amazon Redshift and Redshift Serverless, broadening the analytical and machine learning (ML) solutions available to our customers.

Apache Spark, a widely-used open-source distributed processing system, is integral for handling large-scale data workloads. Developers using Amazon EMR, Amazon SageMaker, and AWS Glue have traditionally relied on third-party Spark connectors to interact with Amazon Redshift, but these connectors often lack regular maintenance, support, and thorough testing for various Spark versions in production environments.

With the launch of Amazon Redshift integration for Apache Spark, users can quickly set up and develop applications in languages such as Java, Scala, and Python. This integration allows for efficient reading from and writing to Amazon Redshift data warehouses without sacrificing application performance or data consistency. Additionally, we see performance enhancements due to pushdown optimizations, enabling up to 10 times faster application performance. We extend our gratitude to the original contributors of the open-source connector project, whose collaboration helped enhance this integration. We remain committed to further improvements and will continue contributing back to the open-source community.

To get started with the Spark Connector for Amazon Redshift, you can explore AWS analytics and ML services. Simply utilize data frame or Spark SQL code in a Spark job or Notebook to connect to your Amazon Redshift data warehouse and execute queries in mere moments. This integration is included in Amazon EMR 6.9, EMR Serverless, and AWS Glue 4.0, which come equipped with the pre-packaged connector and JDBC driver, allowing you to dive straight into coding. For example, EMR 6.9 provides a sample notebook, while EMR Serverless offers a sample Spark Job for your convenience.

Before you begin, ensure that you have configured AWS Identity and Access Management (AWS IAM) authentication between Redshift and Spark, as well as between Amazon Simple Storage Service (Amazon S3) and Spark. A diagram in the AWS documentation illustrates the authentication process between Amazon S3, Redshift, the Spark driver, and Spark executors.

If you already possess an Amazon Redshift data warehouse and data, start by creating a database user and assigning appropriate grants. To utilize this functionality with Amazon EMR, upgrade to the latest version, Amazon EMR 6.9, and select the emr-6.9.0 release when creating your EMR cluster on Amazon EC2. EMR Serverless also allows you to create your Spark application using the emr-6.9.0 release.

For those using AWS Glue 4.0, the spark-redshift connector is available as both a source and target. In Glue Studio, you can visually create ETL jobs to read from or write to a Redshift data warehouse simply by selecting a Redshift connection within built-in source or target nodes. The Redshift connection will contain all necessary details and credentials for accessing Redshift with the appropriate permissions.

In your Glue job setup, ensure that you select Glue 4.0, which supports Spark 3.3 and Python 3. For more guidance, visit Creating ETL jobs with AWS Glue Studio. This resource, alongside additional references on job descriptions provided by SHRM, can greatly assist in your onboarding process.

For a comprehensive overview of Amazon warehouse worker onboarding, check out Glassdoor’s reviews, which serves as an excellent resource for new employees.

At Amazon IXD – VGT2, located at 6401 E HOWDY WELLS AVE LAS VEGAS NV 89115, we are dedicated to enhancing your experience as you embark on this journey.

Amazon Onboarding with Learning Manager Chanci Turner

Related Topics:

Comments

Leave a Reply Cancel reply