How We Manage High Throughput Transactional Events Across Services

Table of Contents

Motivation

At Cashfree Payments, we move billions of dollars every day and each of these transactions has a journey across hundreds of microservices. These high throughput services communicate with each other using asynchronous communication channels such as SQS and Kafka.

For example, for every transaction, we need to send webhooks and also perform post-processing of “side effects”, such as computing merchant charges, updating the ledger as well as scheduling a successful transaction for settlement to merchant accounts.

Asynchronous communication ensures that:

The services are not tightly coupled and can be evolved independently.
It allows us to fail and handle faults gracefully. That is, even if there is an issue with a component that sends webhooks or communications, there is no impact on actual transactions.
It can be retried and if idempotent, it can also be replayed in case of partial downtimes to restore the system to the right state.

Challenges

Most of the challenges of service communication can be traced back to issues with distributed systems and communication over the network.

Transactions in Distributed Systems

Transactions, by old school definition, should be ACID compliant (emphasis on Atomic, all or nothing) and this is really difficult to achieve in distributed systems. Lots of moving parts, loosely coupled communication using APIs, and a high probability of failures makes it hard to predict.

In the above example, once service B crashes or the network times out, there is no easy way for A to make the next decision. It’s even harder if the request from the user is supposed to be atomic; for example payments and settlements.

To solve this synchronous communication challenge, most teams add a layer of indirection using a queuing solution. While this allows for retries and can provide other benefits provided by typical consumer-producer patterns, it also poses an interesting challenge: how do we ensure database commits and queue publish work as a single atomic transaction. Consider this example: a payment process commits and pushes to a notification queue. What would happen if: a)payment commit fails but notification gets delivered leading to bad user experience or b) payment succeeds but no notification and hence payment gets eventually stuck and seems to be abandoned.

A service must atomically update the database and send messages to a queue in order to avoid data inconsistencies and bugs. The queue should also ensure to maintain the order in which messages are sent; so that the out state of every system is aligned once all messages or events are consumed.

Probable Solutions

There are a handful of options at this point:

2 PC (2 Phase commit) Transactions

From Wikipedia:

In transaction processing, databases, and computer networking, the two-phase commit protocol (2PC) is a type of atomic commitment protocol (ACP). It is a distributed algorithm that coordinates all the processes that participate in a distributed atomic transaction on whether to commit or abort (roll back) the transaction.

Pros:

Very simple to implement for the caller

Cons:

Breaks service level encapsulations as cross-DB coordination is required
Performance issues due to network coordination and locks, especially in our environments where failures are expected and traffic can be unpredictable
Can block and stop the world if the coordinator is down
Deadlocks 🙁

Outbox Pattern

Outbox pattern relies on the database itself to handle events and have an external relayer to emit the “outbox” message to the message broker.

In this pattern, a service that uses a relational database inserts messages/events into an outbox table as part of the local transaction. A service that uses a NoSQL database appends the messages/events to the attribute of the record (e.g. document or item) being updated.

CDC or change data capture systems work on connecting to database write-ahead logs and use them for propagating changes using queuing systems such as Kafka. They work on the same principle as database replica, connecting and interpreting each statement log.

Given the highly consistent nature of this data, it’s mostly used in data engineering for creating warehouses/lakes, etc. They also present a great way of generating event streams outside the service with high consistency. Using a broker such as Kafka also allows durability, multiple consumers and replayability allowing new consumers/extensions.

CDC Outbox Pattern

For cross-service coordinations, we have started moving towards using a CDC-backed outbox pattern. We use a relayer layer that also converts event data to a proto message to create a contract between services.

How it works:

The relay layer takes care of cross-cutting concerns such as monitoring, alerting, dashboards, retries, and de-duplications.

This ensures the consumers receive a highly reliable event stream backed by a proto-based contract, which helps in low overhead communication.

Consider the below classes for payment and another carrying domain event on change of payment status or something downstream apps might be interested in.

Let’s consider example usage:

@Table(name=”payment”)
class Payment {
    private Long id;
    private String externalId;
    private BigDecimal amount;
    // Skipped rest for brevity
}

@Table(name=”payment_event”)
class PaymentEvent {
    private BigDecimal amount;
    private String paymentStatus; //Completed, Pending etc
   private String externalId;
}

Now, we publish domain updates and outgoing events (hence, the outbox) to an events table within the same transaction:

@Transactional
void process(Payment payment) {
  PaymentEvent paymentEvent = new PaymentEvent(payment);
  entityManager.persist(payment);
  entityManager.persist(paymentEvent);
}

On the relayer layer, we would need to configure a simple converter that can understand columns from the above payment events table and convert them into the defined proto objects. The relayer layer adds additional 200 microseconds of latency at the 99th percentile while giving its consumers a defined proto definition and an out-of-box observability and monitoring available on the entire pipeline.

Relayer

In our systems, Relayer is a simple Java [Spring Boot] based service which listens to Kafka topics generated by the CDC provider.

Based on defined configurations, it performs two tasks:

Marshall incoming message on a Kafka topic and decide if this is a table for which conversion is required.
If configuration is present, it converts the incoming JSON blob to a predefined proto contract from configuration and publishes it to the target Kafka topic.

This way, any application which requires an event to be consumed can directly consume from the target topic and can get the Proto from the common git repo for proto events. From a platform perspective, we also generate sources from proto and publish them to Nexus for applications to use directly.

Wins

The relayer layer being heavily configuration driven, helps teams focus on business requirements and get onboarded quickly to consume cross-service events. The emphasis on proto as a message exchange format also forces teams to discuss contracts instead of schemaless definitions which can be prone to issues.

Using Kafka as a communication layer also enables new consumers to be onboarded quickly given both topics and the contracts are public. It has helped us scale up to millions of events per second at peak duration while allowing us to be highly available and eventually consistent in our post-processing systems.

What’s Next

While the system works well for our use cases, we still have some ground to cover in making it easier for teams to add their own layers of deduplication and have visibility around idempotency and retries.

There are also a few interesting use cases around being able to clean up the events table, which we will soon pick up.

If you find this interesting, get in touch with us for exciting opportunities for engineers like you!

Motivation

Challenges

Transactions in Distributed Systems

Probable Solutions

2 PC (2 Phase commit) Transactions

Outbox Pattern

CDC Outbox Pattern

Relayer

Wins

What’s Next

Related Posts

Building Cashfree Agent Skills: A Task-Aware Knowledge Layer for AI Coding Assistants

Building AI-Powered Payments with MCP at Cashfree Payments

7 Key Lessons from Using Java 21 Virtual Threads in Production

Discover more from Cashfree Payments Blog