YouTube video summary

How to Build a Reliable Kafka Data Processing Pipeline, Focusing on Contention, Uptime and Latency

Technology10 Feb 20242 min summaryFrom InfoQ
How to Build a Reliable Kafka Data Processing Pipeline, Focusing on Contention, Uptime and Latency
InfoQ
YouTube

System Overview

  • The company sends around 13 billion push notifications daily and had a team of 10 engineers to manage the backend.
  • The original system used synchronous PostgreSQL writes, blocking the HTTP request until the write completed.
  • Traffic spikes occurred at specific times (hourly and half-hourly) due to customers scheduling notifications.

Solution Implementation

  • Introduced a layer of queuing using Apache Kafka, making the system asynchronous.
  • Kafka is a distributed streaming platform that uses topics to logically group messages.
  • Each message in a topic has a numerical ID called an offset that starts at zero and increases over time.
  • Consumers pull messages from Kafka topics and process them.
  • Partitions are numbered logs of messages within a topic that can be consumed independently by multiple instances of a consumer.
  • Subpartition processing is a technique used to process Kafka messages concurrently within each partition in memory, allowing for increased concurrency and flexibility.
  • Created more CUs (queues) to ensure updates for the same row are processed concurrently.
  • Added a cap on the number of messages each consumer instance can hold in memory to prevent overloading.

Issue Identification and Resolution

  • Observed high lag and low CPU usage, contradicting expectations.
  • Implemented centralized logging to gain more observability.
  • Discovered that a single customer (Closely) was dominating the updates, with a single row ID receiving constant incompatible updates.
  • Identified that the updates were related to the "set email" method in the SDK, which was causing 4.8 million user updates to be mirrored to a single record.
  • Updates to the closely app admin record were skipped, and limits were implemented to prevent customers from linking too many records together.

Lessons Learned

  • Shifting intensive API workloads to asynchronous workers reduces operational burden.
  • Subpartition queuing increases consumer currency.
  • Centralized observability is crucial in tracking down issues.
  • Customers can be more creative than engineering, design, and product teams in finding unexpected use cases.
Made with Recall · in 3 seconds

Get a summary like this for anything you read, watch or save.

Recall summarizes any link you paste, then keeps it in your personal library so you can search, chat with it, and never lose a key idea again.

YouTube videosArticlesPodcastsPDFsAnything else
Save this summary

Then save anything you watch or read next.

Bookmark this summary, then save any video, article or PDF you read next.

Save to your library

Ready to get started?

Save, summarize & chat with your content.

GET STARTED

IT'S FREE

No credit card required · 30 Day Refund on Premium · 24 Hour Support

Recall web app on laptop