Cashfree Payments’ Journey to Smarter Log Management

Table of Contents

At Cashfree Payments, we have multiple microservices running on Kubernetes, alongside a few monolithic services on EC2, both contributing to a significant volume of logs. These logs are essential for debugging and are also retained for extended periods to meet compliance requirements.

However, this leads to escalating storage costs. In this blog, we share our insights on building efficient log storage systems to address these challenges.

Challenges for Storing Logs

Log explosion:
As the business grows, the existing service produces more logs, and new services are added. This leads to an increase in storage space and incurs more cost.
Identifying optimum storage:
- ELK (Kibana): The increasing volume of logs and the need for a longer retention policy significantly add enormous costs to ELK-based logging systems and the cost of our infrastructure.
- Cloudwatch: Monolithic services operating in EC2 as legacy solutions were costly as we used Cloudwatch for log storage.

How to Solve a Log Explosion?

We had an initial round of log reduction exercises for our microservices to solve this. Here is how we did it:

Changed the log format from text to JSON: Switching to JSON enabled easier parsing of logs, simplified field searches using JSON keys, and reduced the index size.
Removed redundant logs: We identified and eliminated logs that were no longer needed.

As a result, we successfully reduced the daily log volume by 30%.

How Did We Optimise the Storage?

Despite efforts to reduce log volume, we continued to incur significant ELK (Kibana) costs, with a retention period limited to just 12 days. To address this, we explored using a data lake, which allowed us to process data at a much lower cost and with fewer resources. By leveraging S3 storage, which is more cost-effective than other data storage options, we were able to store more logs at a reduced cost.

Comparison of ELK (Kibana) and Data Lake Table

Since ELK (Kibana) had lower latency but was considerably more expensive, and the Data Lake solution offered significant cost benefits at the expense of higher latency, we developed a hybrid solution for our log storage and retention. As part of this initiative, we also addressed issues related to our log archival procedures.

We resolved almost 98% of our debugging use cases by reviewing logs from the previous four to five days. However, we discovered that the debugging process occasionally required logs older than five to seven days.

Initially, we indexed all keys in the ELK cluster present in the logs, but most were not utilized in our search queries. To optimise this, we limited the keys by using a curated list of the most frequently used indexed keys in ELK.

ELK (Kibana) and Data Lake for Application Logs

Reduced the log retention in ELK to 5 days from the previous 12 days. This reduced our cluster size by nearly 50%, thus decreasing our ELK expenditure.

Reduced the number of indexed keys. Since all the keys were indexed initially, indexing required much more memory than usual. By reducing the number of keys to be indexed we redacted the storage and compute required to build and store indexes. Our storage has shrunk due to this change.
Added a review process to add indexes on new keys. This removed the addition of redundant keys to the index list.
Logs retention in the Data Lake table was set to 30 days. This allows us to troubleshoot issues whose logs are up to 30 days old.
Deprecated the existing archival process and moved to the new archival flow for long retention.
- We used the parquet files from the Data Lake table instead of the raw txt files for the archival process.
- File transfer cost reduced by 100 times.
- The metadata overhead became negligible as the average file size increased.
- Since we moved the processed files to S3 Instant Retrieval, we can store the logs as Data Lake tables with zero retrieval time.

These changes have helped us decrease the cost of the logging system by more than 50%. As a result of the process, we had a new, efficient, and more powerful and efficient log storage process in place.

New Log Retention Flow

Components Used in Running the Log Processing in Data Lake

Apache Spark framework to process the logs
EMR on EKS is used to deploy Spark applications.
- Using the spot node in our EKS cluster
Parquet file format is used to store the logs for faster retrieval and query.
- Using snappy compression along with parquet to store the data.
AWS GLUE Catalog to store the metadata of the tables
Athena to access the logs
Airflow for scheduling

Table Structure

Number of Logs Processed and the Run Time (The batch run happens every 20 mins)

Monolithic Service Logs Optimisation

Some monolithic applications running on EC2 machines could not push logs to ELK (Kibana) due to architectural constraints. Initially, we used the AWS log agent to collect and push logs to CloudWatch, but the cost of storing, querying, and alerting with CloudWatch was prohibitive.

We initiated two exercises for these logs. First, log optimisation, and second, migrate the logs to the data lake.

Log Optimisation

We decreased the log retention on the cloudwatch from a lifetime to 30 days. This made a huge difference in the cost but there was a limit to cost reduction here.

Migrate Logs to Data Lake

Since we were directly getting the logs from EC2 machines, we used the log rotation to create a new log file every hour and then push the created file to S3. Once the logs are in S3, we process them and store them in the data lake table using the Spark data pipeline.

This approach eliminated our reliance on CloudWatch for monolithic services, significantly reducing costs. The expenses incurred through this method are substantially lower than those of AWS CloudWatch.

Monolith Stack Flow

For this, we had to set up a different data pipeline as the log structure differed from Kibana. With this, we can store the logs for a longer period without increasing costs exponentially.

Monolith Stack Table Structure

Conclusion

Cutting log storage costs is both feasible and critical for businesses handling increasing data volumes. Strategies like tiered storage, log compression, and retention policies enabled significant cost savings, with reductions of 80% in application log storage and 70% for monolithic logs. By adopting similar measures, organisations can achieve scalable, efficient, and budget-friendly log management.

The Road Ahead

Here are our focus points for the near future:

Log Archival: Moving the logs from normal S3 to Glacier.
Audit logs: Moving audit logs to the longer retention period for internal audits.
Canonical Logs: Building customer debugging logs for move visibility.