How FactSet Integrated Network Monitoring Across AWS and On-Premises

How FactSet Integrated Network Monitoring Across AWS and On-PremisesMore Info

This article features insights from FactSet’s network engineering team, including Emily Roberts, Senior Network Architect, and David Lee, Lead Systems Engineer, along with AWS Solutions Architects, Alex Johnson and Rachel Smith.

FactSet, as they describe themselves, “provides flexible, open data and software solutions for thousands of investment professionals globally. These solutions deliver immediate access to financial data and analytics that investors rely on to make significant decisions. We are continually striving to enhance the value of our offerings for our clients.”

Introduction

The swift growth of FactSet’s hybrid-cloud infrastructure, which includes AWS, US-based data centers, and global Points of Presence (PoPs), called for a robust and comprehensive network monitoring solution. The Network Engineering team at FactSet is tasked with establishing connectivity across these various locations while optimizing latency. This article delves into how FactSet addressed this challenge by implementing a distributed, infrastructure-agnostic monitoring system, yielding critical insights into network performance across their diverse environments.

Solution Overview

FactSet needed to ensure optimal network latency while catering to the high transactional demands of their applications. Within AWS, they utilize a shared VPC topology, wherein different AWS Accounts share the same VPC(s) in an AWS Organizational Unit (OU). This setup is illustrated in the following diagram (figure 1).

Network Engineering provisions VPCs and establishes connectivity between VPCs and other FactSet locations. There is also a distinct network segmentation across Development, UAT, Production, and Shared Services environments. This environmental segmentation corresponds to AWS Transit Gateway Routing domains, each with its own unique routing policies.

FactSet chose a combination of AWS native network monitoring features and Telegraf, an open-source server agent. They deployed Telegraf server agents on Amazon Elastic Compute Cloud (EC2) instances across AWS Regions, Availability Zones (AZ), and data centers to gather network health metrics through ICMP, HTTP, and DNS probes. By placing agents in all infrastructure locations, FactSet can monitor network health across each AWS Region and AZ. These probes allow FactSet to track response times and packet loss, ensuring the resilience of their network infrastructure.

In conjunction with Telegraf, FactSet employs native AWS network monitoring tools provided by Amazon CloudWatch and the various functionalities of Network Manager for visibility into AWS’s global infrastructure. This solution facilitates consistent and unified monitoring deployable across all locations while also offering intra/inter-VPC visibility within AWS.

Telegraf monitors network health by initiating and responding to probes. Common probes/flows utilized across all environments include:

  • Inter-AZ – Telegraf sends ICMP probes to assess response time and packet loss between EC2 instances in different AZs.
  • Inter-VPC – ICMP probes measure response time and packet loss to instances in the Shared Service environment within the same AWS Region. These flows traverse through Transit Gateway, providing insights into network health across it.
  • Inter-Region – ICMP probes assess response times and packet loss targeting instances in other AWS Regions through Transit Gateway inter-Region peering connections, indicating network health between Regions.
  • Hybrid-Cloud – Telegraf sends ICMP, HTTP, and DNS probes to various targets outside of the Region, traversing through AWS Direct Connect Gateways, Transit Gateways, and Direct Connect Virtual Interfaces (VIFs) to capture overall network health.
  • Internet – Instances conduct various ICMP, HTTP, and DNS probes to evaluate internet reachability from each AWS Region, passing through FactSet Managed Firewalls, NAT Gateways, and Internet Gateways (IGWs) to ensure connectivity and performance.

Solution Technology Stack and Operations

FactSet utilizes the TICK (Telegraf, InfluxDB, Capacitor, and Kibana) stack for their network monitoring solution, providing a robust and scalable architecture. Telegraf, serving as the primary data collector, features built-in plugins for metric configuration and capture.

  • Input Plugins gather metrics as defined by the plugin.
  • Output Plugins write metric data to various collectors or destinations.

The following plugins are utilized in the current Telegraf installations:

  • Ping Input Plugin – Used to ping specified destinations and report back RTT and packet loss. FactSet has set polling to occur every second.
  • HTTP Response Input Plugin – Probes various HTTP/s endpoints to validate reachability, reporting response times based on 200 HTTP response codes.
  • DNS Query Input Plugin – Queries names configured in the telegraf.conf files and reports on the success/failure of queries and response times.
  • HTTP Output Plugin – Exports metrics obtained from the input plugins to an internal API for collecting Telegraf metrics.

Once metrics are ingested through the API, the data is stored in a time-series database, InfluxDB. This data is then represented, queried, and analyzed using Grafana, an open-source observability tool that enables dashboard creation, analytics execution, and alert configuration. Alerts related to network issues are integrated with an internal notification system, ensuring prompt communication. Notifications are resolved through documented standard operating procedures (SOP) for swift incident response.

FactSet employs a continuous integration/continuous deployment (CI/CD) process to maintain consistency and reduce the operational burden of managing Telegraf configurations across AWS Regions. This is depicted in the following diagram (figure 2). Telegraf configuration files are managed in a Git repository, updates are built into a golden AMI, and changes are deployed to EC2 instances using Webhooks and AWS CodeBuild. This streamlined approach enables efficient configuration management across AWS Regions and environments.

Observed Results

The following custom alerts are defined in the rules engine:

  • 100% ICMP loss for a target sustained for one minute
  • Greater than 5% sustained ICMP loss for 10 minutes in the last 30 minutes
  • Greater than 50% sustained deviation over baseline Latency for 5 minutes in the last 10 minutes
  • HTTP Code response other than 200 for any target
  • Non-Zero DNS response code for any DNS lookup against the configured target

These alerts have empowered FactSet to promptly detect network degradation incidents. The customized alerting engine is designed to identify sustained anomalies over specified periods, minimizing false positive alerts and focusing on critical network issues. This solution has expedited troubleshooting and resolution processes, benefiting both FactSet and AWS engineering teams and facilitating future migrations.

For more insights, check out another blog post here. Moreover, for expert perspectives on this topic, visit this authority. Lastly, if you’re looking for guidance on onboarding new hires during challenging times, this resource is excellent.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *