How Cashfree Payments built real-time Success Rate analytics with Apache Flink

Table of Contents

At Cashfree Payments we keep a sharp eye on transaction Success Rates. We do our best to ensure that our Success Rate (SR) consistently exceeds 90%. Monitoring our transaction Success Rate helps us make the right decisions at the right time by rerouting transactions to stable channels. We also contact affected banks and notify customers in case of SR drop.

Earlier System Architecture

Cashfree Payments had built an SR analytics system using SQL-exporter where we run SQL-exporter every 10 seconds and push transaction details to the time series database VictoriaMetrics. We were using VMRule (Victoria Metrics Rule) to create a ruleset that was used to trigger VMAlert (VictoriaMetrics Alert).

However, there were some challenges with this system as mentioned below:

SQL queries were intermittently getting timeout errors when transaction volume and cardinality increased and scaling up SQL was not cost-effective.
The current system had to go through multiple components like an S3 bucket, MySQL database, SQL-exporter, VMRule, VictoriaMetrics TSDB, VMAlert etc. It was very difficult to debug issues since it didn’t have a rich logging system.
The minimum transaction threshold configuration was not working due to the behaviour of Prometheus metrics and VM storage.

What is the solution to the above problems?

We need a solution which can solve the following requirements:

A scalable framework which can compute SR for high transaction volume.
A system which can process transaction data in real time.
Low latency and high throughput system for calculating the transaction SR
A system with better rule configurations, and better logging.

We explored frameworks which can support the above use cases. Apache Spark and Apache Flink are both open-source, distributed data processing frameworks used widely for big data processing and analytics. Apache Spark is the third-generation data processing framework which processes data in micro batches. Apache Flink is the fourth-generation data processing framework which does data processing in near real-time. We used Apache Flink because of its advanced features and it fits our use cases.

Apache Flink: Architecture and Components

Apache Flink is a distributed processing engine for stateful computations over unbounded and bounded data streams. The Flink application runs on a cluster which has a job manager and a bunch of task managers. A job manager is responsible for the effective allocation and management of computing resources. Task managers are responsible for actually doing the computation.

The client system submits the job graph to the job manager. The JobGraph represents a Flink dataflow program, at the low level that the JobManager accepts. The job graph is converted to an execution graph by the job manager. Also, it submits execution tasks to task managers. Task managers do computation and sync back the result with the job manager.

Building Realtime Success Rate System in Cashfree Payments

Architecture

After we decided to use Apache Flink as our stream processing framework, there were other choices to be made for source of input, caching etc.

Maxwell CDC pipeline: All the transaction events are pushed to Maxwell pipeline from the payment service and we chose Maxwell Kafka as our source of input for calculating transaction SR.

Redis: We wanted to store all SR rules in the cache and fetch rules quickly to the Flink application to apply them in real-time.

Kafka: Once we identify that the SR is less than the configured threshold, an alert message is put into the Kafka topic which is sent to our merchants and internal teams.

Flink Application Components

Flink Kafka Consumer: Consumes stream of transactions data from Kafka topic.

Watermark Strategies: Out-of-order transactions are handled by configuring a watermark strategy. The time notion considered here is Event time (transaction time which is attached to each record). This allows for consistent results even in the case of out-of-order events or late events.

DynamicKeyFunction: This performs data enrichment with a dynamic key. The subsequent keyBy hashes this dynamic key and partitions the data accordingly among all parallel instances of the following operator.

Flink Aggregator: This uses the Flinks keyBy operator to do group by on fields like payment gateway, payment mode, merchant identifier etc and computes aggregate transaction count to calculate SR in real-time.

How DynamicKeyFunction works:

Rule Executor: The rule executor calculates SR by using the formula SR = success transaction count / total transaction count. Then checks if any rule threshold is breached. If yes, alert deduplication is done and a valid alert is pushed to Kafka.

Challenges faced

Building a production-ready application with a new framework comes with its challenges. We were able to overcome all of these challenges and completed the production-ready system.

Restrictions on the number of windows:

Flink Windows splits the stream into “buckets” of finite size, over which we can apply computations like aggregation. More details are here.

We can’t have many sliding windows with different window sizes. This is because the same transaction will be present in all of these windows and we will be processing events under all these windows. Since all processing happens in memory, we can’t hold more Windows states.

So, we have restricted the number of sliding windows for real-time to support only 15m, 30m, 1h and 12h window sizes.

Flink Application Deployment

Apache Flink is an adaptable framework and it allows multiple deployment options to deploy an application. Flink on Amazon EKS, AWS Managed service Kinesis Data analytics and EMR with EC2 are some of them. Based on our requirements and cost feasibility we decided to deploy the Flink application in Amazon EKS using Flink Kubernetes Operator. Flink Kubernetes Operator acts as a control plane to manage the complete deployment lifecycle of Apache Flink applications.

The current state of the SR system at Cashfree Payments

The impact of using Apache Flink

A real-time system: We built a better system that works in subsecond latency for finding SR for customers. It reduced a lot of noise and false alerts.
A more scalable system: Apache Flink is a highly scalable system, with horizontal scaling. The system can scale to a high number of transactions. We did load testing by creating ~20k rules across multiple merchants with 15m, 30m, 1h, and 12h sliding windows enabled. The system was scalable and didn’t face any scale issues.
A better logging system to debug issues: Multiple components are reduced to a single Flink component. Flink comes with rich logging configurations and we can add more custom logs for getting more information. We are pushing logs to Kibana(info logs) and Datadog (warn and error logs).
A better Monitoring and Alert system: Flink supports an inbuilt dashboard and alerting system to track system health. It also provides custom metrics and alerting. We have integrated with Prometheus for pushing metrics, Grafana for visualisation, and Grafana-oncall for getting alerts.

What next?

Currently, we are working on Merchant-level SR analytics of our APIs using Apache Flink.

Did this post intrigue you? If so, get in touch with us! We have some exciting opportunities in store for excellent engineers just like you!