Introduction

The Indian Premier League (IPL) isn’t just a cricketing phenomenon — it’s a digital rush of transactions that demands robust payment systems. At Cashfree Payments, a pioneering digital payments and banking API company, our journey during IPL 2023 was a roller coaster of challenges, innovations, and ultimate success. This blog post delves into our rigorous efforts to scale our systems for the intense IPL season, from the first CSK vs. GT match on March 31st to the final where CSK clinched its 5th victory.

1. Laying the Foundation: Preparing for IPL

Our journey kicked off with meticulous preparation. Knowing that UPI transactions ruled the IPL landscape, we conducted exhaustive performance testing on the UPI flow. Our initial tests revealed a significant hiccup — our systems experienced CPU throttling on the RPS which was far short of the anticipated IPL volumes. Swift action was taken: we allocated additional resources, effectively boosting our capacity to handle the expected scale.

2. March 31st: The Journey for Improved Systems Begins

March 31st marked the beginning of our drive towards scalable systems. The RPS soared to more than a thousand, and we encountered our first roadblock—a critical service crashed due to an Out-of-Memory (OOM) kill. The coming days were riddled with challenges:

  • CPU and Memory Throttling: Our services experienced periodic CPU and memory throttling, straining their efficiency and transaction speed
  • High Latencies in Go Services: The prevalence of pending Goroutines and high latencies in critical Go services
  • Latency Spikes in Spring Services: Increasing CRUD operation durations in critical Spring services, a result of a surge in pending HikariCP pool connections, introduced high latencies
  • Downstream Service Choking: Some downstream services faltered in scaling up, causing bottlenecks and impeding critical transactions

3. Swift Actions Amidst Crunch Time

With the pressure of league matches demanding quick solutions, we devised a three-pronged approach: tuning resources/configurations, performance enhancements, and tier management settings. Our strategies included:

  • Horizontal and Vertical Scaling: Pods were scaled both horizontally and vertically, alongside adjustments to allocated CPU/memory resources
  • HikariCP Connection Pool Tuning: We optimised HikariCP connection pool size to resolve pending connection issues
  • Tomcat Configuration Tweaks: Fine-tuning tomcat maxthreads and minsparethreads configurations was essential
  • Optimising JVM Heap Size: Adjustments were made to JVM allocated heap size to accommodate increased memory requirements by increasing Hikari and Tomcat connections
  • Garbage Collector Transition: We transitioned to the G1GC garbage collector from serial GC. We observed reduction in Stop the World GC events post transition
  • Performance Enhancement and Query Tuning: Our optimisation journey wasn’t confined to configurations alone. We delved into identifying and resolving slow database queries and APIs. We achieved remarkable results, significantly reducing one of the critical API’s latency from ~200 ms to ~15 ms through strategic database query enhancement.
  • Bulkheading and Effective Testing: To further streamline our services, we isolated non-critical flows into a separate instance, ensuring only vital flows were directed to our main production instance. While implementing these enhancements, we recognised that our existing test suite lacked efficacy in replicating production traffic. The QA team collaborated to create a more accurate test suite.

4. Unraveling a Hidden Challenge: The 3000 RPS Ceiling

As IPL viewership peaked during matches involving CSK, RCB, and MI, we encountered an interesting challenge. Despite internal payment processing running smoothly, our edge layer (coded in Go) struggled with peak latencies at nearly thousand RPS. This marked the beginning of extensive collaboration through long war rooms, with teams pooling their expertise to find a solution.

5. The Pursuit of Smoothness: Debugging the Latency Spikes

Intensive investigations led us to experiment with Golang’s HTTP connection pool. Despite fine tuning and optimising the connection pool values and vertical scaling, the key lay in maintaining a delicate balance between pod count and connection pool values. In-depth HTTP connection pool metrics and custom trace logs revealed DNS lookup as the primary culprit.

6. Slaying the Latency Dragon: Unveiling Solutions

Resolving latency spikes meant tackling DNS lookup delays. We took two key actions:

  • Maximising Connection Reuse: Creating http connection each time is a costly operation which involves multiple steps (DNS lookup, establishing TCP connection to server etc). So, we need to reuse the http connection as much as possible rather than creating it. Spring boot was not receiving the Connection : keep-alive header which was closing the connection at client side. This ended up creating a new connection every time at the edge layer. To solve this, we added a keep-alive header to every request made to downstream services.
  • Forcing Cgo DNS Resolver: By default, Go uses the Go DNS Resolver. Go’s native DNS resolver uses http to resolve DNS values and does not cache it. Our final hunch was that this is causing the issue of DNS lookups taking time at the edge layer. We fixed this problem by forcing the use of the Cgo DNS resolver.

7. Collaborating with Merchants and Banks:

We proactively engaged with merchants who reported connection drops and worked closely with them to diagnose and debug their systems and assisted them in implementing connection pooling solutions. Additionally, we encountered a situation where we experienced delays in receiving webhooks from banks, which in turn affected the transaction processing time. Recognising the critical nature of this issue, we initiated direct communication with the respective bank and included them in our war room sessions. This collaborative effort allowed us to pinpoint and resolve the configuration issue within the bank’s systems promptly, ensuring that transactions moved smoothly to success.

8. The Journey’s Culmination: A Flawless Production

With changes in place, testing and monitoring were essential. Our new test suite, mirroring production traffic patterns, allowed us to confidently deploy modifications. The results were astonishing. Two days after implementation, the system breezed through the once-dreaded 3000 RPS without any hitches. As the knockout matches approached, we were equipped to handle even higher traffic loads.

9. Reflecting on Progress and Optimising Efficiency

With challenges conquered, we reassessed our resource allocation, realising that our initial resource bump could be scaled down. To address peculiar traffic patterns pre-match, we scripted custom scaling on top of auto-scalar to meet our unique scaling demands, optimising resource utilisation during peak hours. Doing this we brought down our AWS costs by ~USD 20K, which is the additional cost that we would have incurred if we had kept using the same resources.

Conclusion:

Cashfree Payments’ journey in scaling systems for IPL 2023 was an embodiment of teamwork, innovation, and relentless pursuit of excellence. From battling latency spikes to enhancing connection pools, our engineers collaborate seamlessly to fortify our infrastructure. As CSK clinched its fifth IPL title, we celebrated not only a cricketing triumph but also our success in engineering scalability, resilience, and exceptional payment experiences.

As we have geared up for the World Cup 2023, we’ve implemented a series of critical enhancements, including automatic horizontal scaling, circuit breaker, improved transactional management and many more, to make our infrastructure ready for the intense demands of the tournament. More on this in the next blog post. Stay tuned!

Does this sound exciting to you? Well then, there are some great opportunities in store for intrepid engineers just like you!

Discover more from Cashfree Payments Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading