FactSet, known for developing flexible data and software solutions for investment professionals globally, continuously strives to enhance the value of its products for customers. With the rapid growth of its hybrid-cloud infrastructure that includes AWS, U.S.-based data centers, and global Points of Presence (PoPs), FactSet faced the challenge of establishing an effective and comprehensive network monitoring system. The Network Engineering team is tasked with ensuring seamless connectivity and optimal latency across these diverse locations. This article highlights how FactSet successfully addressed this challenge by deploying a distributed, infrastructure-agnostic monitoring solution that offers critical insights into network performance.
Solution Overview
To maintain optimal network latency while supporting the high transaction demands of its applications, FactSet implemented a shared VPC topology in AWS, allowing multiple AWS accounts to share the same VPC(s) within an AWS Organizational Unit (OU). The following diagram illustrates this architecture (figure 1).
Figure 1: FactSet Network and Telegraf Agent Architecture
The Network Engineering team provisions VPCs and establishes connectivity between them and other FactSet locations, implementing distinct network segmentation across Development, UAT, Production, and Shared Services environments. This segmentation is mirrored in AWS Transit Gateway Routing domains, each with tailored routing and propagation policies.
FactSet adopted a hybrid approach that combines native AWS network monitoring features with Telegraf, an open-source server agent. Telegraf server agents were deployed on Amazon Elastic Compute Cloud (Amazon EC2) instances across various AWS Regions, Availability Zones (AZ), and data centers to collect health metrics using ICMP, HTTP, and DNS probes. This setup allows FactSet to monitor network health within every AWS Region and AZ, ensuring the resilience of their network infrastructure.
In tandem with Telegraf, FactSet leverages native AWS network monitoring tools like Amazon CloudWatch and Network Manager for enhanced visibility into AWS’s global infrastructure. This combination results in a unified monitoring experience across all locations, allowing for intra- and inter-VPC visibility on AWS.
Telegraf server monitors network health by initiating and responding to various probes, including:
- Inter-AZ – ICMP probes to assess response times and packet loss between EC2 instances in different AZs.
- Inter-VPC – ICMP probes to measure response time and packet loss to instances in the Shared Service environment within the same AWS Region, traversing through Transit Gateway.
- Inter-Region – ICMP probes to evaluate network health between instances across other AWS Regions, utilizing Transit Gateway inter-Region peering connections.
- Hybrid-Cloud – Probes for ICMP, HTTP, and DNS targets outside the Region, providing insights through AWS Direct Connect Gateways and Transit Gateways.
- Internet – Various probes from instances to assess internet reachability and performance from each AWS Region.
Technology Stack and Operations
FactSet employs the TICK (Telegraf, InfluxDB, Chronograf, and Kapacitor) stack for its network monitoring needs, offering a robust and scalable architecture. Telegraf serves as the principal data collector, equipped with built-in plugins for efficient metric configuration and capture.
Key plugins utilized in Telegraf include:
- Ping Input Plugin – Pings designated destinations to report round-trip time (RTT) and loss, set to poll every second.
- HTTP Response Input Plugin – Probes HTTP/s endpoints to validate reachability and report response times based on successful (200) HTTP response codes.
- DNS Query Input Plugin – Queries configured names and reports the success or failure state, along with response time.
- HTTP Output Plugin – Exports metric data to an internal API for collecting Telegraf metrics.
The data collected is ingested via API and stored in a time-series database, InfluxDB, from which it is visualized and analyzed using Grafana. This open-source observability tool enables the creation of dashboards, analytics, and alert configurations. The alerting system is integrated with an internal notification system to ensure prompt communication, with notifications addressed through standardized operating procedures (SOP) for efficient incident response.
To maintain consistency and ease the operational burden of managing Telegraf configurations across AWS regions, FactSet employs a continuous integration/continuous deployment (CI/CD) process. Configuration files are managed in a Git repository, with updates built into a golden AMI and deployed to EC2 instances using Webhooks and AWS CodeBuild. This streamlined approach enhances configuration management efficiency across all environments (figure 2).
Figure 2: FactSet Telegraf Deployment Workflow
Observed Results
The following custom alerts have been established within the rules engine:
- 100% ICMP loss for a target sustained for one minute.
- Greater than 5% sustained ICMP loss for 10 minutes over the last 30 minutes.
- More than 50% sustained deviation from baseline latency for 5 minutes in the last 10 minutes.
- Non-200 HTTP code responses for any target.
- Non-zero DNS response codes for any DNS lookup against configured targets.
These measures enable FactSet to swiftly identify network degradation incidents. The customized alerting engine is designed to detect sustained anomalies, minimizing false positives and focusing on significant network issues. This solution has expedited troubleshooting and resolution efforts, benefiting both FactSet and AWS engineering teams while facilitating future migrations. For more insights on similar topics, check out this blog post here. Also, refer to this authority on the subject for additional information. For those interested in career opportunities, this resource is excellent.
Leave a Reply