How to Build a Reliable Kafka Data Processing Pipeline, Focusing on Contention, Uptime and Latency
10 Feb 2024 · 2 years ago

System Overview
- The company sends around 13 billion push notifications daily and had a team of 10 engineers to manage the backend.
- The original system used synchronous PostgreSQL writes, blocking the HTTP request until the write completed.
- Traffic spikes occurred at specific times (hourly and half-hourly) due to customers scheduling notifications.
Solution Implementation
- Introduced a layer of queuing using Apache Kafka, making the system asynchronous.
- Kafka is a distributed streaming platform that uses topics to logically group messages.
- Each message in a topic has a numerical ID called an offset that starts at zero and increases over time.
- Consumers pull messages from Kafka topics and process them.
- Partitions are numbered logs of messages within a topic that can be consumed independently by multiple instances of a consumer.
- Subpartition processing is a technique used to process Kafka messages concurrently within each partition in memory, allowing for increased concurrency and flexibility.
- Created more CUs (queues) to ensure updates for the same row are processed concurrently.
- Added a cap on the number of messages each consumer instance can hold in memory to prevent overloading.
Issue Identification and Resolution
- Observed high lag and low CPU usage, contradicting expectations.
- Implemented centralized logging to gain more observability.
- Discovered that a single customer (Closely) was dominating the updates, with a single row ID receiving constant incompatible updates.
- Identified that the updates were related to the "set email" method in the SDK, which was causing 4.8 million user updates to be mirrored to a single record.
- Updates to the closely app admin record were skipped, and limits were implemented to prevent customers from linking too many records together.
Lessons Learned
- Shifting intensive API workloads to asynchronous workers reduces operational burden.
- Subpartition queuing increases consumer currency.
- Centralized observability is crucial in tracking down issues.
- Customers can be more creative than engineering, design, and product teams in finding unexpected use cases.
Browse more from
Software Development

Open Source Friday with Alex Lichter and Nuxt.js

Event in Spanish: Asegurando el Código con Jeffrey Guerra

Repository Rules: Code Compliance at Scale

Open Source Friday with Flash-X: a Multiphysics Simulation Software

Open Source Friday with Quincy Larson & FreeCodeCamp

Navigating AI, Platform Engineering, and Staff-Plus: InfoQ Dev Summit Preview
Ready to get started?
Save, summarize & chat with your content.
GET STARTED
IT'S FREE
No credit card required · 30 Day Refund on Premium · 24 Hour Support
