How to Build a Reliable Kafka Data Processing Pipeline, Focusing on Contention, Uptime and Latency

10 Feb 2024 · 2 years ago

System Overview

The company sends around 13 billion push notifications daily and had a team of 10 engineers to manage the backend.
The original system used synchronous PostgreSQL writes, blocking the HTTP request until the write completed.
Traffic spikes occurred at specific times (hourly and half-hourly) due to customers scheduling notifications.

Introduced a layer of queuing using Apache Kafka, making the system asynchronous.
Kafka is a distributed streaming platform that uses topics to logically group messages.
Each message in a topic has a numerical ID called an offset that starts at zero and increases over time.
Consumers pull messages from Kafka topics and process them.
Partitions are numbered logs of messages within a topic that can be consumed independently by multiple instances of a consumer.
Subpartition processing is a technique used to process Kafka messages concurrently within each partition in memory, allowing for increased concurrency and flexibility.
Created more CUs (queues) to ensure updates for the same row are processed concurrently.
Added a cap on the number of messages each consumer instance can hold in memory to prevent overloading.

Observed high lag and low CPU usage, contradicting expectations.
Implemented centralized logging to gain more observability.
Discovered that a single customer (Closely) was dominating the updates, with a single row ID receiving constant incompatible updates.
Identified that the updates were related to the "set email" method in the SDK, which was causing 4.8 million user updates to be mirrored to a single record.
Updates to the closely app admin record were skipped, and limits were implemented to prevent customers from linking too many records together.

Shifting intensive API workloads to asynchronous workers reduces operational burden.
Subpartition queuing increases consumer currency.
Centralized observability is crucial in tracking down issues.
Customers can be more creative than engineering, design, and product teams in finding unexpected use cases.