Since its inception in 2015, Cashfree Payments has seen near exponential growth. With aggressive growth also comes the quest to add features and products faster for our customers, at scale.

The way we provide a lot of value for our users is to enable a fast feedback loop and small iterations. 

The only way to do this at scale is to be able to deploy with confidence. There are tons of books, articles and talks which have been done over the years; and some ideas from there have been proven to work at scale. Principles such as Xp/TDD + Pair Programming and CI + CD are no longer new or risky concepts to enable small release cycles. 

But, unless we have the capability to observe a system at runtime, we cannot be confident of deploying and will always have constraints of deployment time/deployment at low traffic hours and constant human support. Manual efforts and coordinations are always prone to errors, computers are way better than humans for repeated and predictable tasks. The way to scale these observability/monitoring scales is via automation.

With our applications split between Java and Go, we wanted a way to measure the overall health of our systems at runtime and be able to respond to it. 

Measure and Improve 

“If You Can’t Measure It, You Can’t Improve It.”

The idea of making changes at scale without worry is to be able to measure and get near real-time stats of system execution and insights. 

This is especially true in distributed systems, where one failure can cause cascading effects. These side effects can eventually lead to multiple system failures.

What to Measure:

  • How do we set timeouts for downstream service? 
  • How do I attribute latency issues to a service/database call? 
  • How many requests is my service currently doing? 
  • What is the error percentage and duration? 

How to Measure

Our monitoring stack consists of Prometheus + Alert Manager + Grafana, which adds monitoring and transparent observability to services. These, along with distributed tracing and logging infrastructure (Elastic+Kibana), support most of our monitoring requirements. 

What’s missing

While most libraries would provide  out-of-the-box support for basic metrics, there is still a lot of missing observability: 

  • Queue execution durations
  • Queue depth for asynchronous workloads
  • Background processing stats 
  • Business level metrics (Success/=Failed etc) 

Once these metrics are available, the next step is to add alerts and notifications based on SLOs defined.

Java Apps

Spring Boot provides an actuator library to provide most monitoring capabilities out of the box. This includes the basic observability requirements (RED -Request Error Duration) 

  • Requests/minutes (or per second) depending on scale.
  • Error rates.
  • Easy ability to add metrics using micrometer or prometheus libraries. 
  • Metrics for a lot of components which are packaged by Spring. 
  • Duration (90percentile and 99percentile) of request durations
    • At scale, averages are no good since they hide outliers 

Go Apps 

Most services at Cashfree Payments use echo as the framework for web and the services use echo-prometheus package which provides basic HTTP server metrics. 

But this leaves a lot to be desired. For example: 

  • Measuring latencies on external HTTP calls
  • Database performance metrics (Errors, durations etc)
  • Ability to easily add custom and business-specific metrics.

Also, how do we determine what component is causing latency for one of our endpoints? 

To solve these, we created a generic library that helps offload capturing metrics across components with minimal code changes. 

What it provides: 

  • Standard RED [Request Error Duration] metrics for http requests. 
  • Ability to create a custom counter, histogram or summary vector with ease. 
  • HTTP client connection stats (90-99percentile latency for each external call).
  • DB Connection stats:
    • Open/Idle/Active connections
    • Query Durations ( 90-99percentile latency )

How It helps

The library packages the http client and a database client with prometheus metrics being bundled with most operations. Given our focus on compatibility, we ensure that this can be introduced in applications without much code change, both http client and database clients are compatible with Go’s library.  

Having a common dashboard having overall metrics for the application, also helps us quickly correlate patterns such as high latencies due to external service or database workloads. 

Here are a few snippets from our grafana dashboard to help demonstrate the added value post-migration to the library:

How We Monitor Go Apps
How We Monitor Go Apps
How We Monitor Go Apps

We have also started introducing a lot of custom and business context sensitive metrics which can alert us as soon as things start going south, thus enabling a better customer experience.

While this has helped us a lot in understanding the runtime health of our Go apps, there are a lot more plans around adding additional features. 

We are also planning easy integration eventually with distributed tracing as well, making it a one-stop shop for our monitoring needs. 

Also, the ability to sample by enabling pprof at small intervals and collecting vital statistics to help further understand probable performance issues. 

To say the least, we are excited about  all our upcoming experiments.

Is this something that hits all the right spots, cerebrally speaking?? Well then, we have some exciting opportunities lined up for awesome engineers just like you!

How We Monitor Go Apps
Author

Discover more from Cashfree Payments Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading