YouTube video summary

How to Build a Reliable Kafka Data Processing Pipeline, Focusing on Contention, Uptime and Latency

Technology

10 Feb 20242 min summaryFrom InfoQ

How to Build a Reliable Kafka Data Processing Pipeline, Focusing on Contention, Uptime and Latency

Save to your library

Chat with this summary

System Overview

The company sends around 13 billion push notifications daily and had a team of 10 engineers to manage the backend.
The original system used synchronous PostgreSQL writes, blocking the HTTP request until the write completed.
Traffic spikes occurred at specific times (hourly and half-hourly) due to customers scheduling notifications.

Solution Implementation

Introduced a layer of queuing using Apache Kafka, making the system asynchronous.
Kafka is a distributed streaming platform that uses topics to logically group messages.
Each message in a topic has a numerical ID called an offset that starts at zero and increases over time.
Consumers pull messages from Kafka topics and process them.
Partitions are numbered logs of messages within a topic that can be consumed independently by multiple instances of a consumer.
Subpartition processing is a technique used to process Kafka messages concurrently within each partition in memory, allowing for increased concurrency and flexibility.
Created more CUs (queues) to ensure updates for the same row are processed concurrently.
Added a cap on the number of messages each consumer instance can hold in memory to prevent overloading.

Issue Identification and Resolution

Observed high lag and low CPU usage, contradicting expectations.
Implemented centralized logging to gain more observability.
Discovered that a single customer (Closely) was dominating the updates, with a single row ID receiving constant incompatible updates.
Identified that the updates were related to the "set email" method in the SDK, which was causing 4.8 million user updates to be mirrored to a single record.
Updates to the closely app admin record were skipped, and limits were implemented to prevent customers from linking too many records together.

Lessons Learned

Shifting intensive API workloads to asynchronous workers reduces operational burden.
Subpartition queuing increases consumer currency.
Centralized observability is crucial in tracking down issues.
Customers can be more creative than engineering, design, and product teams in finding unexpected use cases.

Made with Recall · in 3 seconds

Get a summary like this for anything you read, watch or save.

Recall summarizes any link you paste, then keeps it in your personal library so you can search, chat with it, and never lose a key idea again.

YouTube videosArticlesPodcastsPDFsAnything else

Save this summary

Keep it in your library.

Save to your library

Browse all from InfoQ →

Why We Deprecated Google Analytics (And Built a System 3x Cheaper)

Why We Deprecated Google Analytics (And Built a System 3x Cheaper)

YouTube05 Jul 2026

Craig McLuckie on Culture as a Team's Operating System in the AI Era

Craig McLuckie on Culture as a Team's Operating System in the AI Era

YouTube15 Jun 2026

Netflix Engineering Director: Why Code Scales Systems, But Clarity Scales Orgs

Netflix Engineering Director: Why Code Scales Systems, But Clarity Scales Orgs

YouTube08 Jun 2026

Why Scaling Teams Spikes Human Latency (And How to Fix It)

Why Scaling Teams Spikes Human Latency (And How to Fix It)

YouTube07 Jun 2026

How AI Erased the Software Implementation Bottleneck (90% Shipped Code)

How AI Erased the Software Implementation Bottleneck (90% Shipped Code)

YouTube02 Jun 2026

Requirements Analysis for Architects: A Conversation with Sonya Natanzon

Requirements Analysis for Architects: A Conversation with Sonya Natanzon

YouTube02 Jun 2026

Ready to get started?

Save, summarize and chat with your content.

IT'S FREE

No credit card required · 30 Day Refund on Premium · 24 Hour Support

Recall web app on laptop, personal AI knowledge base for summarizing and chatting with your content