Distributed Tracing & APM: The What, Why and How

Table of Contents

At Cashfree Payments, we have millions of transactions happening everyday and each of these transactions has a journey across tens of microservices. Our motivation was to make any software-related changes at scale without worrying about being able to measure and get near real-time stats of system execution/insights.

Especially in distributed systems like ours, where one failure can cause cascading effects, these side effects can eventually lead to multiple system failures. Unless we can observe a system, we cannot be confident enough of deploying services and will always have constraints of time/traffic/hours/support.

Manual efforts/coordinations are always error-prone and the only way observability/monitoring scales is via automation.

What are the Problem Statements?

How can I attribute latency factor per service?
How can I know the error rate and do its tracking per service?
How can I know upstream and downstream services for any given service?
How can I know the API hit duration and span list for optimisation?
How can I build alerts on the above metrics?

Goals

At Cashfree Payments, we wanted to move to a system where teams can seamlessly monitor and observe their services. This includes being able to look at traffic patterns, latencies and the ability to trace the requested journey.

With the growing number of services, we needed to move to either a managed solution or build enough capability and allocate bandwidth to be able to support different needs. We used open-tracing libraries and APIs for instrumenting tracing in our applications and configured alerting on top of it.

First Thing First, Let’s Learn The Jargon

Request Tracing & Span – A trace¹ is a representation of a series of related distributed events that encode the end-to-end request flow through a distributed system. This is critical for distributed systems², as it allows debugging requests spanning multiple services, helping identify the source of latency/increase in resource utilisation. Each span in the trace represents a single unit of work during that journey, such as an API call or database query.

In a nutshell, distributed tracing is the method of tracking application requests as they flow from frontend applications to backend services and databases. Developers can use distributed tracing to troubleshoot requests that exhibit high latency or errors.

APM – Application Performance Monitoring System (APM) gives deep visibility into applications for services, queues, and databases to monitor requests, number of API hits, errors tracking, latency, infrastructure metrics across hosts, containers, proxies, and serverless functions.

Working of Distributed Tracing Tools

Distributed tracing tools typically support three phases of request tracing:

Instrumentation
First, you modify your code so requests can be recorded as they pass through your stack. In our case we have added the custom metric tracing library (open-tracing libraries) to the codebase repository. Below is the code snippet of golang

			
func InjectTracing(ctx context.Context, req *http.Request) (*http.Request, *HttpSpan) {
  parentSpan := opentracing.SpanFromContext(ctx)
  tracer := opentracing.GlobalTracer()
  if parentSpan != nil {
    carrier := opentracing.HTTPHeadersCarrier(req.Header)
    ctx, _ := tracer.Extract(opentracing.HTTPHeaders, carrier)
    op := "[external.http]_" + req.Host
    sp := tracer.StartSpan(op, ext.RPCServerOption(ctx), opentracing.ChildOf(parentSpan.Context()))
    ext.HTTPMethod.Set(sp, req.Method)
    ext.HTTPUrl.Set(sp, req.URL.String())
    ext.Component.Set(sp, op)
    req = req.WithContext(opentracing.ContextWithSpan(req.Context(), sp))
    err := tracer.Inject(sp.Context(), opentracing.HTTPHeaders, carrier)
    if err != nil {
      fmt.Printf("error injecting trace %v", err)
    }
    return req, &HttpSpan{&sp}
 }
  return req, &HttpSpan{nil}
}

		

Data collection
Second, configure the metrics scrapper agent as code is already instrumented in the above step. Now, traces will be scrapped from the instrumented services by the metrics scrapper agent and will be sent to the backend.
Analysis and visualisation
Finally, the spans are unified into a single distributed trace and encoded with business-relevant tags for analysis. Depending on the distributed tracing tool being used, traces may be visualised as flame graphs or other types of diagrams.

How did we do it for hundreds of Microservices

At Cashfree Payments, we have 150+ codebase repositories supporting different coding languages such as Java, Golang, Python etc. which require a code change for the instrumentation of tracing and alert monitoring.

Let’s see how we solved this challenge seamlessly:

We went ahead and created a core team of 2 people who would be responsible for end-to-end execution of this project, right from POC to code change to take this live. Needless to say, this includes support in case of any production issues occurring due to these changes for a period of one quarter.
To optimise the above activity, we created an internal custom metric tracing library whose dependency was added in the client of each codebase repository. By doing this, we have not only avoided duplicated code across all repositories, but also saved our effort and time as well.
For configuring alerts and its routing rule as per the team in each code repository based on its collected metrics, we have created an automation script which added desired alerts based on language and domain in each service code repository, and raised a pull request⁴ for service owners to review and deploy it to production. This Automation script further saved our time and effort plus avoided any human error.
Rather than being reactive about the occurrence of issues with this change, we proactively enabled SonarQube⁵ so that any issues can be caught in the code review phase itself.
To be double sure about not breaking anything in production, we first tried deploying services in the Beta environment and then post successfully observing the behavior of the services, and shipped the code to the production environment.

Wins from Distributed Tracing

Reduced time to Detect and Respond
Now, if a customer reports that a feature in an application is slow or broken, the support team reviews distributed traces to determine if this is a backend issue. Engineers then analyse the traces generated by the affected service to quickly troubleshoot the pinpointed problem.
Alerts via Monitors and Watchdogs
Now alerts have been set up on the metrics/log patterns or APM and it helps in proactively checking the system’s health.

Gaining deep insight into services
Understanding service dependencies has become very easy with an auto-generated service map from traces alongside service performance metrics and monitor alert statuses. Now, we can analyse:

Individual database queries
Endpoints correlated with infrastructure
Monitor service performance
Compare between versions for rolling, blue/green, shadow, or canary deployments.

Understand latency breakups and service metrics
By viewing distributed traces, developers now understand cause-and-effect relationships between services and optimise their performance. Tracing gives us insights into how services communicate, the time taken in executions etc.
Improve collaboration and productivity
In microservice architecture, different teams own the services that are involved in completing a single request. Distributed tracing makes it clear where an error occurred and which team is responsible for fixing it.
Maintain Service Level Agreements (SLAs)
Distributed tracing tools aggregate performance data from specific services, so teams can readily evaluate if they comply with SLAs⁶.

Conclusion

While implementing distributed tracing does introduce some overhead and complexity, the benefits it brings far outweigh the challenges.

As software systems continue to evolve and become more complex and distributed, distributed tracing plays an increasingly vital role in ensuring the smooth operation and performance of these systems.

In this blog, we have shared a sneak peak of our journey with distributed tracing and we can definitely say this is not the end of our tech evolution as at Cashfree Payments where we embrace the challenge of exceeding yesterday’s success!

Does this sound exciting and intriguing to you? Well then, there are some great opportunities in store for intrepid engineers just like you!