Migrate a Large Data Warehouse from Greenplum to Amazon Redshift with AWS SCT – Part 2

In the second installment of our multi-part series, we delve into best practices for selecting the ideal Amazon Redshift cluster, refining data architecture, converting stored procedures, and handling compatible SQL functions and queries. We also provide tips on optimizing data type lengths for table columns. Be sure to check out the first post in this series for a comprehensive guide on planning, executing, and validating a large-scale data warehouse migration from Greenplum to Amazon Redshift utilizing the AWS Schema Conversion Tool (AWS SCT).

Selecting Your Ideal Amazon Redshift Cluster

Amazon Redshift offers two types of clusters: provisioned and serverless. Provisioned clusters require you to establish necessary compute resources, while Amazon Redshift Serverless enables high-performance analytics in the cloud at any scale. For further details, refer to the article on Introducing Amazon Redshift Serverless – Run Analytics At Any Scale Without Having to Manage Data Warehouse Infrastructure.

An Amazon Redshift cluster is composed of nodes, including a leader node and one or more compute nodes. The leader node receives client application queries, parses them, and formulates execution plans. It coordinates the run of these plans across compute nodes and aggregates the results before returning them to the client applications.

When selecting your cluster type, consider:

Estimating the size of the input data, including compression, vCPU, and performance. As of now, we recommend the Amazon Redshift RA3 instance with managed storage, which allows for independent scaling of compute and storage for optimal query performance.
Amazon Redshift features an automated “Help me choose” tool based on your data size.
One major advantage of utilizing Amazon Redshift in the cloud is the flexibility to experiment with different cluster configurations, unlike traditional data warehouses bound by hardware limitations. This promotes quicker innovation and allows you to select the best-performing and cost-effective option.
During the development phase or pilot, starting with fewer nodes is usually advisable. As you transition to production, adjust the number of nodes according to your usage patterns. For cost savings, consider opting for the reserved instance type. The public utility Simple Replay can assist in evaluating performance across various cluster types by simulating customer workloads. If using the suggested RA3 instance type, you can compare different node configurations to identify the most suitable option.
Depending on your workload, Amazon Redshift supports resizing, pausing, stopping, and concurrency scaling of the cluster. The workload management (WLM) feature allows for effective management of memory and query concurrency.

Creating Data Extraction Tasks with AWS SCT

AWS SCT extraction agents facilitate parallel migration of your source tables. With a valid user authentication on the data source, you can allocate resources for that user during the extraction process. AWS SCT agents handle data locally and upload it to Amazon Simple Storage Service (Amazon S3) via the network (using AWS Direct Connect). Consistent network bandwidth between your Greenplum machine and AWS Region is highly recommended.

For tables containing around 20 million rows or approximately 1 TB, the virtual partitioning feature in AWS SCT can help extract data efficiently. This feature divides the extraction process into several sub-tasks, enabling parallel data extraction. We suggest forming two task groups for each migrating schema: one for small tables and another for large tables utilizing virtual partitions.

For more details, check out the guide on Creating, running, and monitoring an AWS SCT data extraction task.

Data Architecture

To streamline and modernize your data architecture, consider the following strategies:

Designate accountability and authority to uphold enterprise data standards and policies.
Standardize the data and analytics operating model across the enterprise and various business units.
Simplify your data technology ecosystem by rationalizing and modernizing data assets and tools.
Establish organizational frameworks that promote better integration between business and delivery teams, allowing for the development of data-oriented solutions that address business challenges and opportunities throughout the lifecycle.
Regularly back up data to ensure recovery options are available in case issues arise.
Incorporate data quality management during planning, design, execution, and ongoing maintenance to achieve the desired outcomes.
Emphasize simplicity to create easy, fast, intuitive, and cost-effective solutions. A straightforward approach scales better than complex methods, allowing for innovation. For instance, when migrating, focus on only the necessary data in tables and schemas. If performing a truncate and load for incremental data, identify a watermark to only process that incremental data.
Use cases may require record-level inserts, updates, and deletes for compliance with privacy regulations or streamlined pipelines. We recommend selecting tools based on your specific use case. For instance, AWS provides options like Apache HUDI with Amazon EMR and AWS Glue.

Migrating Stored Procedures

In this section, we discuss best practices for migrating stored procedures from Greenplum to Amazon Redshift. Data processing pipelines that involve complex business logic often rely on stored procedures for data transformation. We recommend leveraging big data processing frameworks like AWS Glue or Amazon EMR to modernize your ETL (extract, transform, load) processes. For more guidance, refer to Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift. If you’re facing time constraints in migrating to cloud-native solutions like Amazon Redshift, migrating stored procedures from Greenplum to Amazon Redshift can be an effective strategy.

To ensure a successful migration, adhere to Amazon Redshift’s stored procedure best practices:

Specify the schema name when creating a stored procedure to enhance schema-level security and manage access controls efficiently.
To avoid naming conflicts, it’s advisable to prefix your procedure names with “sp_.” This prefix is exclusively reserved for stored procedures in Amazon Redshift, minimizing the possibility of future naming conflicts.
Always qualify database objects with the schema name within the stored procedure.
Apply the minimal required access principle and revoke unnecessary permissions. Ensure that stored procedure execution permissions are not granted to ALL users.
The SECURITY attribute governs a procedure’s access privileges. When creating a stored procedure, you can set the SECURITY attribute to either DEFINER or INVOKER. By specifying SECURITY INVOKER, the procedure will operate under the privileges of the user invoking it.

For further insights into professional development, check out this informative piece by Women Discuss Professional Development via SHRM. Additionally, for more on career skills and training, this resource from Fast Company is an excellent read. Lastly, for tips on your annual review, visit this blog on DIY Annual Reviews.

Migrate a Large Data Warehouse from Greenplum to Amazon Redshift with AWS SCT – Part 2

Selecting Your Ideal Amazon Redshift Cluster

Creating Data Extraction Tasks with AWS SCT

Data Architecture

Migrating Stored Procedures

Related Topics:

Comments

Leave a Reply Cancel reply