Introduction

Handling millions of transactions per minute during major events like e-commerce flash sales, tatkal ticket booking, or the IPL (Indian Premier League) is an exhilarating challenge for any payment processing system. These events often lead to a surge in online activity translating into massive payment transactions, which must be processed quickly and seamlessly.

Imagine the stakes: if the system falters, not only could it lead to lost revenue, but it could also damage the reputation of the merchants and the payment processor. 

So at Cashfree Payments, our  goal is to ensure that every transaction is processed promptly and securely without compromising on  downtime or performance degradation. By implementing advanced scaling and monitoring strategies, we help our users manage these spikes effectively, ensuring their applications can handle the load without issues.

Identifying Traffic Types

Unpredictable Traffic Spikes

Unpredictable traffic surges can occur suddenly and without warning. We employ proactive capacity planning to manage these, leveraging historical data to estimate potential maximum loads. Our system is designed to handle up to five times increase in traffic, with Horizontal Pod Autoscaling (HPA) and rate limiters ensuring real-time scalability and traffic management.

Predictable Traffic Spikes

Predictable traffic surges can be anticipated based on historical data or scheduled events, such as sales events or sports matches. The traffic generally peaks shortly before and during these events. To manage this, we schedule resource scaling to match these predictable patterns, ensuring our system is ready to handle the increased load.

However, even predictable spikes challenge autoscaling mechanisms, which can take time to bring new nodes online and several minutes to scale services. This delay can cause performance issues if the spike occurs faster than the system can scale. Therefore, proactive scaling and pre-warming resources before the expected spike are critical strategies.

Fig1: Burst traffic

Monitoring High-Scale Traffic

Monitoring high-scale traffic is crucial for keeping up with  the performance, security, and reliability of applications and websites. Effective monitoring also helps in the early detection of issues and in maintaining optimal performance. Here are the steps we follow to set up a robust monitoring system.

Setting Up a Robust Monitoring System for High-Scale Traffic

Step 1: Gathering Requirements and Profiling Merchants

We start by listing the expected traffic from each merchant during the event and use monitoring tools with advanced traces to determine how each external request, such as a payment API call, which can convert to multiple internal API requests. These internal requests may include checking merchant configuration, assessing risk, fetching essential data, and pushing information to a queue for asynchronous processing, among others. This process translates into Queries Per Second (QPS) to internal services, allowing us to profile merchants by identifying the APIs they use and estimating the traffic for each API.

Step 2: Listing Services and APIs to Monitor

Given our microservice architecture, we compile a comprehensive list of services and their corresponding APIs that require monitoring. This includes identifying critical services such as payment processing, risk assessment, and configuration management, as well as their respective APIs. 

By mapping out these services and APIs, we take care that all our crucial components are covered, allowing us to track performance, error rates, resource utilisation, and other key metrics for each service. This helps in targeted monitoring and helps in promptly identifying and addressing any issues that arise.

Step 3: Selecting Metrics to Monitor

We compile a cumulative list of metrics to be monitored, which includes:

  • Traffic Volume/Spike
  • Performance Metrics
  • Error Rates
  • Resource Utilisation
  • Network Metrics
  • Product/User Experience Metrics
  • Comparison of requests from origin to service and database requests
  • Connection Metrics
  • Cache Performance
  • Queue Lengths and Processing Times

Step 4: Creating a Centralised Monitoring Dashboard

To effectively monitor system performance and resource utilisation, we employ various monitoring tools:

  • Datadog with OpenTelemetry for APM and tracing
  • Grafana powered by VictoriaMetrics for APM custom metrics like product metrics
  • AWS Monitoring for detailed graphs of AWS resources like RDS, Redis, SQS

Although 80% of the major metrics required for monitoring are available on Datadog, we still need other tools. We created a centralized dashboard on Datadog and integrated it with Grafana and AWS using Datadog’s iframe and link features.

Fig 2: Monitoring System Setup

Fig 2: Centralised Monitoring Dashboard

Comprehensive Preparation Strategy

Scaling Strategy

Implementing Auto-Scaling to Handle Traffic Surges

We used Horizontal Pod Autoscaling (HPA) and scheduled auto-scaling of resources to ensure our system can handle varying traffic conditions. Here is a sample configuration of our auto-scaler using KEDA (Kubernetes Event-Driven Autoscaling):

This configuration shows how we use multiple triggers to auto-scale our resources based on CPU, memory utilisation, and custom Prometheus metrics, as well as scheduled cron jobs to handle predictable traffic spikes.

Developing a Pre-Scaling Tool

We created an in-house tool designed for precise capacity planning for each service. This tool enables us to input the expected transactions per second (TPS) for specific features during peak times. Using data from Datadog APM, we can accurately calculate the average TPS and convert these into queries per second (QPS) for both primary and downstream services. The tool then calculates the number of pods needed to handle the anticipated traffic, ensuring optimal performance and availability.

For instance, if we anticipate that a feature will handle 500 TPS during a peak event, and our capacity planning shows that one transaction translates to 8 queries (500 TPS converts to 4000 QPS), and a pod can handle 1000 QPS, the tool determines that at least 4 pods are required to manage this load efficiently. This proactive scaling lets us maintain optimal performance and availability.

Additionally, the tool automatically adjusts the KEDA configuration to fine-tune the auto-scaling parameters based on these traffic projections, allowing us to dynamically allocate resources as needed.

Implementing Chaos Mesh Testing

To test the robustness and preparedness of our services, we use Chaos Mesh to introduce faults and disruptions, such as network latency and service failures, while also employing tools to generate unplanned traffic spikes. This approach helps us identify the breaking points of each service under production traffic patterns, ensuring our systems can effectively handle real-world scenarios.

Configuring Alerts for Early Detection

We established basic rules to trigger alerts:

  • Low severity alerts at 60% resource exhaustion
  • High severity alerts at 80% resource utilisation
  • Alerts for critical metrics to notify us of potential issues before they impact use

Here is an example of a monitoring alert configuration for high CPU usage using Datadog

Fig 3: High CPU Usage Alert Configuration 

This alert monitors the CPU usage of the samplesvc service and triggers notifications when it exceeds the specified thresholds.

Building and Managing a High-Performance Monitoring Team

We assembled a high-performance monitoring team comprising representatives from different products to:

  • Validate metrics
  • Monitor service behavior through the centralised dashboard
  • Provide continuous feedback and improvements in service

Including a member from each team whose services fall under P0 and P1 flows ensures that all teams are working towards a common goal. This setup also allows us to address any issues quickly and efficiently since the necessary expertise is readily available.

Graceful Degradation and Continuous Improvement

Graceful Degradation

We prioritise critical business functions, such as transaction processing (P0), over less critical services, like analytical systems (P2), ensuring uninterrupted service during peak periods. Resilience measures like circuit breakers are implemented based on error rates, latency, and other metrics, maintaining service availability and user trust under stress.

Regular Meetings and Drills

Regular meetings with our high-scale monitoring team validate metrics and look after service performance through our centralised dashboard. While in-person meetings facilitate quick identification and resolution of issues, production drills simulate traffic scenarios, allowing proactive adjustments to upscale or downscale activities, and improvinging system resilience and readiness.

Fig 4: Correlation Between Request Volume and Resource Utilisation

During High-Traffic Events

Enforcing a Deployment Freeze During Peak Hours

We freeze deployments during peak hours (not the entire day) to avoid disruptions. While we safeguard our deployments under feature flags and rollout percentages, we still impose a deployment embargo during peak hours to ensure maximum stability.

Coordinating Through a High Traffic Day War Room

We have created a War Room for major events. In the War Room, our high-scale team and each service owner check the metrics and give the go-ahead for scaling both compute and database resources. The War Room link is shared with all stakeholders:

  • Any on-call personnel must join the link during the event if they receive an alert.
  • Anyone who wants to see our readiness can join and observe our preparations.

Real-Time Alerts and Merchant Escalation Matrix

Using a robust escalation matrix, we promptly notify merchants of any issues through real-time alerts. This proactive communication ensures transparency and swift resolution of potential disruptions.

Conducting a Post-Event Review for Continuous Improvement

Following each event, we conduct a comprehensive post-event review to analyse outcomes, identify areas for improvement, and celebrate successes. Along the same lines, recognising achievements fosters team motivation and underscores our commitment to continuous enhancement.

Cultivating Team Engagement and Positivity

Monitoring high-scale traffic is inherently a meticulous and demanding task. However, at Cashfree Payments, the teams are always supportive of taking on new challenges.

To keep everyone engaged and excited, we introduced a game that brought a sense of fun and competition to our daily routines. Team members would guess the traffic volume for specific events, placing friendly bets on their predictions. This lighthearted competition not only made the monitoring process more enjoyable but also encouraged active participation and attentiveness. The thrill of seeing whose guess was closest kept the team motivated and fostered a collaborative spirit.

Each  milestone, especially scaling the system beyond previous records, has fueled the enthusiasm of the team and kept us motivated to do more. We are driven by a continuous hunger for innovation, improvement, and breaking new grounds.

Author

Discover more from Cashfree Payments Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading