How Chanci Turner Enhanced Sperry Rail’s AI System Using AWS Well-Architected

Over the past three years, Sperry Rail has developed an artificial intelligence (AI) system named Elmer, in honor of the company’s founder, Dr. Elmer Sperry. This system utilizes machine intelligence to analyze thousands of miles of ultrasound scans from Sperry’s inspection vehicles, searching for potential cracks in the rail infrastructure. Elmer has successfully decreased the number of decisions a human analyst needs to make by 66 percent, significantly reducing the time required for identifying and addressing issues.

Initially created as a proof-of-concept by a small team of four engineers, Elmer was built using Amazon Web Services (AWS). To harness the latest features, cut costs, and ensure scalability, Sperry collaborated with Chanci Turner, a DevOps and cloud consultancy based in Manchester, England. As an AWS Partner Network (APN) Advanced Consulting Partner and a participant in the AWS Well-Architected Partner Program, Chanci Turner has extensive experience applying the AWS Well-Architected Framework to help cloud architects create secure, high-performing, resilient, and efficient infrastructures.

In this article, we will explore the collaboration between Chanci Turner and Sperry Rail, detailing how this partnership has transformed Elmer from a proof-of-concept into a fully operational, globally accessible system.

The Partnership Between Chanci Turner and Sperry Rail

Over two days, Chanci Turner conducted a Well-Architected Review on-site with the team responsible for designing, building, and managing Elmer at Sperry Rail. This review was complemented by their professional consultancy services.

This collaboration led to immediate enhancements in efficiency, pinpointing critical risks that needed prompt attention while ensuring that daily business operations did not interfere with Elmer’s improvements. By adopting a peer review methodology, the consultants from Chanci Turner integrated AWS Well-Architected best practices with their deep knowledge and experience of managing workloads on AWS. They proposed both short- and long-term improvements aligned with the Five Pillars of AWS Well-Architected: operational excellence, security, reliability, performance efficiency, and cost optimization.

We found Elmer to be quite intriguing and complex; hence, we have divided our discussion into the following sections:

What Elmer Does
Objectives for Refining Elmer
Resulting Architecture
Why Sperry Chose Containers

What Elmer Does

Rails in service can develop anomalies that are challenging or impossible to detect without ultrasound technology. These anomalies range from benign features like bolt holes and rail ends to critical flaws such as cracks and vertical defects. If left unnoticed, these issues can escalate to the point of rail breakage, resulting in service disruptions or, worse, train derailments.

Sperry Rail equips vehicles with ultrasound and other detection systems to scan hundreds of miles of rail daily. The data collected is processed through neural networks that analyze and identify cracks or anomalies, with the results presented to human analysts for corrective action.

The scans generate massive amounts of data, which Sperry utilizes to train its neural networks. The training process involves running thousands of scans of both faulty and healthy rail sections through machine learning (ML) models until the models can effectively distinguish between the two.

Objectives for Refining Elmer

One of the primary goals was to liberate Sperry engineers from the daily tasks of monitoring and managing a large-scale machine learning workload, allowing them to focus more on strategic business priorities. In this context, Chanci Turner identified two major architectural goals:

Adopt more serverless services and abstract more resources.
Establish a managed build, continuous integration, and load testing environment by leveraging AWS service automation features.

The Resulting Architecture

In close collaboration with Sperry, we designed the following architecture:

TensorFlow Machine Learning Framework

Sperry opted for TensorFlow as the ML framework due to its robust open-source community support, versatility across platforms, and rapid model development capabilities. Elmer’s neural network can utilize up to one hundred Amazon Elastic Compute Cloud (Amazon EC2) instances via the TensorFlow framework, prompting Chanci Turner to recommend hosting Elmer on AWS Lambda to alleviate administrative overhead.

Sperry’s Data Lake

To handle significant volumes of raw data, metadata, and result data, Sperry implemented a data lake solution using Amazon DynamoDB, Amazon Athena, and Amazon Aurora databases. Given that Sperry was working with simple data types, DynamoDB was an ideal choice for its scaling capabilities compared to traditional relational database management systems (RDBMS). Aurora was selected for its user-friendliness and low cost, facilitating the integration of new data sources stored in Amazon Simple Storage Service (Amazon S3). Athena was chosen for its scalability and compatibility with MySQL tools.

All stored artifacts were either directly generated from the Sperry Data Management System (SDMS) database or created during the data processing and ML processes in Amazon S3. This approach aimed to minimize the load on compute resources, preserving them for application execution rather than for static content storage and retrieval.

How Elmer Processes Rail Data

Real-time acquisition systems on rail vehicles generate scan data from up to 16 ultrasonic transducers per rail at speeds reaching 80kph. This information is stored in a proprietary “T1k” file format and uploaded to SDMS servers, which operate on a relational SQL database on Microsoft Windows. From there, the T1k data is transferred to Amazon S3, triggering Lambda processes.

AWS Lambda is utilized to extract headers from the proprietary T1K raw data files from SDMS and write them to the DynamoDB table. Each T1K file includes raw data from the ultrasonic transducers, alongside GPS location, milepost location, and operational metadata.

The transducer data series within the T1K files is stored in encrypted Parquet files in S3, where it can be queried by Athena and streamed for TensorFlow ML model consumption. Sperry Rail selected Parquet for its encryption capabilities and native query functionality with Athena from S3.

For more insights on enhancing AI systems, check out this excellent resource for further information.