Learn About Amazon VGT2 Learning Manager Chanci Turner
In 2008, Zalando emerged as Europe’s premier online fashion and lifestyle platform, boasting over 32 million active customers. As a lead data engineer at Zalando and an integral part of our cloud transformation, I want to share how Amazon Simple Storage Service (Amazon S3) became foundational to our data infrastructure. In this article, I will outline Zalando’s need for enhanced data insights, the challenges posed by our historical technology stack, and our decision to migrate to AWS, specifically utilizing Amazon S3 to establish a data lake. Lastly, I will elaborate on how our usage of Amazon S3 has progressed—from enabling employee access to data to optimizing storage costs through various Amazon S3 storage classes. I hope you find valuable insights in our journey toward becoming a data-driven organization at a multi-petabyte scale.
Zalando’s Technology Evolution
In 2015, Zalando operated as a fashion retailer with a substantial on-premises monolithic IT environment. The complexity of the systems grew alongside the number of teams needing to contribute, leading to a decision to transition from a traditional online retailer to a comprehensive fashion platform. This shift necessitated scalability, and after careful evaluation, we selected AWS as our cloud provider for its durability, availability, and scalability. Additionally, the extensive range of services offered by AWS presented numerous future opportunities.
Zalando’s monolithic structure was reconfigured into microservices, assigning teams end-to-end responsibilities in their own development and operations spheres. This transformation significantly influenced our data landscape. With central databases becoming decentralized backends, communication shifted to REST APIs. Our central data warehouse, previously connected to transactional data stores, faced challenges in managing a decentralized data environment.
To address these issues, we formed a central team tasked with building Zalando’s data lake. The initial motivations included establishing a central data archive within this new distributed framework and creating a distributed computing engine for the organization.
Upon eliminating the size constraints of relational databases, we discovered that Zalando was generating a wealth of potentially valuable data. We needed a storage solution capable of handling this increased volume while remaining scalable, reliable, and cost-effective. After exploring AWS’s offerings, Amazon S3 emerged as the optimal choice for our new central data lake.
Our primary focus during the initial setup of the data lake was integrating key sources of data. At that time, we had implemented a central event bus designed for service-to-service communication among the distributed microservices. We introduced an archiver component to retain copies of all published messages in the data lake, rendering the event bus’s content invaluable for analytics. The ingestion pipeline was constructed using serverless components to meet basic data preparation needs, including reformatting and repartitioning.
While migrating to the cloud, we continued to generate valuable datasets within our original data warehouse. This included critical datasets like the company’s central sales logic. A secondary pipeline ensured that data warehouse datasets were also accessible in the data lake. Additionally, web tracking data played a crucial role in this integration due to its considerable size and value when paired with existing datasets. These three pipelines established a steady inflow of data into our initial data lake.
S3 Features and Their Applications
The continuous growth of our Amazon S3-based data lake led to various scenarios that prompted us to utilize a range of S3 features. Below, I will discuss the features leveraged at Zalando, highlighting their benefits and relevant use cases.
Data Sharing and Access
The first challenge we faced was ensuring data accessibility. With multiple teams utilizing their own AWS accounts, we needed effective cross-account data sharing solutions. Initially, we utilized bucket policies, which allow for the attachment of definitions directly to a bucket, specifying which roles can execute actions like GetObject on specific resources, thus simplifying the process for a limited number of connections. However, as demand for data access grew, managing these bucket policies became cumbersome, and we eventually reached size limits.
To adopt a more scalable and isolated approach, we transitioned to using IAM roles. These function similarly to bucket policies, but with a trust relationship to the target account enabling access. This change streamlined our data sharing processes significantly.
Data Backup and Recovery
With a large data lake shared among multiple production use cases, we needed a strategy for data backup and recovery. Implementing versioning for our production buckets was the simplest solution. Versioning allows for the retention of previous object versions, ensuring accessibility for older versions or recovery of deleted objects, even when they are not visible via standard S3 API calls.
This feature proved invaluable, particularly in instances where a bug in our data pipeline necessitated rolling back outputs. For instance, in 2017, we faced a significant incident where a large volume of web tracking data was mistakenly deleted. Thanks to versioning, we could recover the lost data efficiently.
For further insights on navigating change in organizations, explore this article from SHRM, an authority on change management. Additionally, if you’re interested in workplace training and safety, don’t miss this excellent resource from Amazon.
To read about how working moms can balance career and family, check out this blog post on Career Contessa.
In conclusion, our journey with Amazon S3 has been transformative, enabling Zalando to harness the power of data effectively while enhancing operational efficiency.
Leave a Reply