adyen

WhatsApp at Scale: Conversations and Clusters

Staff Engineer Retrospective · Bird (2019 – 2024)
100,000 channels · 12,000 RPS peak · 6 GKE clusters · 300,000 containers

Navigate with arrow keys · Press F for fullscreen

Two Layers, Both Mine

The performance story lives in the conversations layer. The scaling story lives in the clusters. Both were my ownership.

Bird Platform
customers trigger campaigns, service messages, OTP

3,000 RPS outbound requests ↓↑ 9,000 RPS webhook callbacks

I OWNED THIS

Conversations Layer

Queue Management Receives, prioritises and paces outbound messages. Per-customer isolated queues with HIGH/MEDIUM/LOW lanes.

Rate Enforcement Distributes load across channels respecting Meta's 1,000 RPS ceiling per number.

Webhook Processing Ingests status callbacks (sent, delivered, read) and routes to customers.

State Store: Cloud Spanner Message state, delivery status, and per-customer queue metadata backed by Spanner. Strong consistency across regions without per-channel MySQL sharding overhead.

routed per channel ↓↑ status webhooks per message

I OWNED THIS TOO

WhatsApp On-Premise Clusters

6 GKE clusters

100,000 channels

300,000 containers

100,000 Cloud SQL

↓↑ Meta Network

The Scale Challenge: 3,000 RPS Into a 1,000 RPS Ceiling

Meta enforces a hard rate limit per phone number. Large customers like Shein blew past it on day one.

The inbound pressure

Shein runs a flash sale campaign: 3,000 messages per second into the conversations layer. All valid, all expected, all must be delivered.

3,000 RPS inbound

Meta's hard ceiling

Meta accepts a maximum of 1,000 messages per second per phone number. Non-negotiable. Exceed it and messages are rejected, not queued.

1,000 RPS ceiling per number

The webhook multiplier

Every message generates 3 status callbacks from Meta: sent, delivered, read. 3,000 RPS outbound becomes 9,000 RPS of webhooks to process simultaneously.

9,000 RPS webhook storm

      The routing problem:

      At 3,000 RPS and a 1,000 RPS ceiling, Shein needs at least 3 phone numbers active simultaneously. The conversations layer must distribute load across channels while tracking each number's live RPS budget.

      Peak load on conversations layer:

      3,000 RPS outbound + 9,000 RPS webhooks = 12,000 RPS total through a single service, with millisecond internal latency targets.

What Broke First: The Queue and the Noisy Neighbour

The queue grew because Meta drained slowly. One customer's campaign made everyone else wait.

The problem

Shein sends 3,000 RPS into a single shared queue

Meta drains at 1,000 RPS. Queue grows at 2,000 messages per second

A 10-minute Shein campaign injects 1.8M messages into the backlog

Every other customer's messages — including OTP logins — sit behind Shein's backlog for minutes

The business impact: A user trying to log in via OTP gets their code 4 minutes late. That's an authentication failure, a support ticket, and a customer trust problem.

The optimization journey

Phase 1: Vertical scaling Scaled up queue processor capacity. Made the drain faster but did not fix fairness. Shein still starved everyone else, just slightly less slowly. Hit a ceiling quickly.

Phase 2: Priority-based queuing Three lanes: HIGH (OTP), MEDIUM (service messages), LOW (marketing campaigns). OTP is now guaranteed fast delivery regardless of campaign volume. Fixed the critical path. Still noisy across customers within each lane.

Phase 3: Isolated customer queues Each customer gets their own queue per priority lane. Shein's 1.8M message backlog stays in Shein's lane. No other customer is affected. Per-customer rate limiting enforces Meta's ceiling independently.

The Atom: One Channel = One WABA Phone Number

WhatsApp's on-premise container imposes a hard 1:1:1 binding between a phone number, a database, and a volume

ONE CHANNEL (e.g. +31 6 1234 5678)

GKE Pod: WhatsApp On-Premise Container Stack

coreapp

500m CPU · 2 GB

message bus

master

500m CPU · 2 GB

1 pod always

web ×N

500m CPU · 2 GB

scales with RPS

Persistent Storage

Persistent Volume

100 GB SSD

media & message store

Cloud SQL Instance

dedicated per channel

no shared MySQL possible

      Hard constraint #1

      WhatsApp on-premise is not multi-tenant.
One MySQL instance per channel. Non-negotiable.
    

      Hard constraint #2

      Max 64 shards per account.
Horizontal scale is bounded per channel.
    

      Design choice

      Most channels run <5 RPS.
500m CPU · 2 GB RAM was sufficient.
    

Growth Was Linear. The Infrastructure Was Exponential.

Every new phone number a customer onboards multiplies the entire stack by one

One customer with 50 channels

150 containers (3 per channel)

50 Cloud SQL instances

5 TB provisioned storage

→

At 100,000 channels (peak)

300,000 containers running simultaneously

100,000 Cloud SQL instances across GCP projects

~10 petabytes provisioned storage

6 GKE clusters · 100+ GCP projects

Adding one customer was a product decision. But every phone number they onboarded was an infrastructure provisioning event: at full stack cost.

Current Architecture: 6 Clusters, One Rule

Rule One channel is assigned to exactly one cluster and never moves. Customers can span multiple clusters.

Cluster 1 · eu-west1

~16,700 channels

~50,000 containers

GCP projects: Cloud-SQL-01 → 17

channels from many customers

Cluster 2 · eu-west4

~16,700 channels

~50,000 containers

GCP projects: Cloud-SQL-18 → 34

channels from many customers

Cluster 3 · us-east1

~16,700 channels

~50,000 containers

GCP projects: Cloud-SQL-35 → 51

channels from many customers

Cluster 4 · us-west2

~16,700 channels

~50,000 containers

GCP projects: Cloud-SQL-52 → 68

channels from many customers

Cluster 5 · ap-southeast1

~16,700 channels

~50,000 containers

GCP projects: Cloud-SQL-69 → 85

channels from many customers

Cluster 6 · eu-north1

~16,500 channels

~49,500 containers

GCP projects: Cloud-SQL-86 → 100

channels from many customers

Why one channel per cluster?
A channel's state (MySQL, volume, pod) is fully contained in one cluster. No cross-cluster state to synchronise. Provisioning and deprovisioning are local operations.

The cost of this rule:
A customer with 50 channels could span 3 different clusters. Debugging their incident meant hopping clusters. Support tooling had to federate across all 6.

The Three Hard Walls

Each wall was a GCP or Kubernetes hard limit. Not something you tune, something you architect around

1,000

Nodes per GKE Cluster

GKE default hard limit. At ~500m CPU per channel and ~16 channels/node, this capped each cluster at roughly 16,000 channels before degraded scheduling.

Hit first → spawned Cluster 2

Trigger: node pressure alerts, evictions

100

Cloud SQL Instances per GCP Project

Default GCP quota. 100k channels required ~1,000 dedicated Cloud SQL instances per cluster at scale. Quota raise requests were slow. Multi-project became the only path.

Chronic pain → 100+ GCP projects

Trigger: provisioning failures for new channels

128

Persistent Disk Attachments per Node

GCE limit on how many PDs can attach to one VM. With 100 GB volumes per channel, scheduling had to stay aware of disk attachment counts, not just CPU/RAM. Added a constraint to the bin-packing problem.

Ongoing constraint → scheduling complexity

Trigger: pending pods with "too many volume attachments"

    The compounding problem: Hitting Wall 1 forced multi-cluster. Multi-cluster increased the number of Cloud SQL instances per cluster boundary, accelerating Wall 2. All three walls were hit within 18 months of reaching scale.
  

The Operational Tax: Linear Complexity, Exponential Toil

Every one of these needed to be automated and kept working across 6 clusters simultaneously

🚀

Channel provisioning New channel = create namespace, deploy 3 containers, provision Cloud SQL, attach PV, register with control plane. Fully automated, but any step failing left the channel in a broken half-live state.

🔍

Debugging a broken channel First: which of 6 clusters? Then: which namespace? Then: is the PV mounted? Is Cloud SQL reachable? Which GCP project is its instance in? A support ticket was a 30-min cluster-hopping exercise without tooling.

🔄

WhatsApp version upgrades Meta releases new on-premise versions. Rolling 100k channels across 6 clusters without downtime required sequenced rollouts, rollback windows, and health checks per cluster, for every release.

💾

Cloud SQL backups 100k instances × automated backup windows = significant GCP billing and scheduling. Retention policies, snapshot audits, and restoration testing were recurring operational overhead.

⬆️

Cluster Kubernetes upgrades GKE version upgrades had to be coordinated across all 6 clusters. A bad upgrade on Cluster 1 would pause work on 2–6. Control plane upgrades required maintenance windows.

📊

Quota management ~100 GCP projects required continuous quota monitoring. A new Cloud SQL project had to be pre-created before the current one hit 100 instances. Quota request lead times were 2–5 business days.

The engineering cost didn't scale with users. It scaled with channels. Adding 10,000 new channels meant 10,000 new things that could break at 3am.

Strategic Trade-offs: What We Chose and What We Sacrificed

Every architectural decision was a deliberate bet. These are the three that mattered most.

DECISION: Single shared queue for all customers initially

One queue, simpler to build and operate. Priority lanes added later. Customer isolation added last after the noisy neighbour incident.

→

GAINED:
Fast to ship. No per-customer queue provisioning overhead. Simple mental model early on.

SACRIFICED:
Fairness. One large customer (Shein) made the entire platform feel slow for everyone else. OTP messages waited behind marketing campaigns.

DECISION: 500m CPU / 2 GB RAM minimum per channel

Allocate the minimum viable resources. Most channels run <5 RPS and rarely spike.

→

GAINED:
Dense node packing. Cost-efficient at scale. High channel-per-node density.

SACRIFICED:
Burst headroom. High-traffic campaigns caused OOM kills or throttling until manual HPA tuning.

DECISION: Cloud SQL over self-managed MySQL

Use GCP managed MySQL. No DBAs needed, automated backups, replication handled by GCP.

→

GAINED:
Zero MySQL ops overhead at launch. Automated failover. Encryption at rest out of the box.

SACRIFICED:
Hit Cloud SQL instance quota (100/project) with no good workaround. Cost at 100k instances was significant vs. self-managed.

Two Futures: The Fork in the Road

When on-premise hit its architectural ceiling, Bird had two credible paths forward

PATH A: Not Taken

MySQL Proxy Layer + K8s Operator

Build a MySQL proxy (ProxySQL / Vitess) that makes a shared MySQL cluster appear as isolated instances to each WA container: one schema per channel, one connection pool per channel.

A Kubernetes Operator handles the full channel lifecycle: provision, upgrade, migrate, deprovision, automatically, across all clusters.

This is what would have enabled Bird to pitch Meta as a WhatsApp hosting partner. We had the operational depth. The missing piece was the proxy abstraction.

Outcome: Keep on-premise, reduce cost 10×, pitch Meta for partnership

PATH B: Taken ✓

WhatsApp Cloud API Migration

Meta launched the WhatsApp Cloud API, a hosted version that eliminates the on-premise container entirely.

Bird migrated channels to the Cloud API, replacing the entire on-premise stack with API calls. No containers. No volumes. No Cloud SQL. No cluster management.

The operational tax evaporates. Engineering capacity shifts from keeping the lights on to building product features.

Outcome: Zero infra burden, faster feature velocity, Meta owns availability

    Path A was the right technical bet. Path B was the right strategic bet. Meta building Cloud API meant the proxy layer would have competed with the platform vendor, a losing position long-term.
  

Future Architecture: What the Migration Actually Looks Like

Before and after: the same business capability, radically different infrastructure footprint

Before: On-Premise (per channel)

Bird Platform

↓ deploys & manages

WA Container Stack
coreapp + master + web

↓ connects to

Cloud SQL Instance
dedicated per channel

↓ mounts

Persistent Volume
100 GB SSD

↓ tunnels to

Meta Infrastructure
WhatsApp network

300,000 containers · 100k Cloud SQL · ~10 PB storage · 6 clusters · 100+ GCP projects

After: WhatsApp Cloud API

Bird Platform

↓ HTTPS API call

WhatsApp Cloud API
Meta-hosted · globally available

↓ delivers to

End Users

0 containers · 0 Cloud SQL instances · 0 volumes · 0 clusters to manage

Trade-off accepted: Availability SLA now owned by Meta, not Bird. Outage = vendor dependency. Rate limits enforced per phone number, not per cluster.

The Road Not Taken: MySQL Proxy Layer

How the architecture would have looked if Bird had solved the constraint instead of escaping it

Proposed: Proxy-mediated MySQL per cluster

WA Container (channel X)
connects to mysql://proxy:3306/channel_x

↓

ProxySQL / Vitess Proxy (per cluster)
routes to schema channel_x · enforces isolation · connection pooling

↓

Shared MySQL cluster (1 per GCP project)
1,000 schemas per instance · schema-per-channel isolation

Result: 100,000 channels → ~100 MySQL instances (vs. 100,000 Cloud SQL instances)

K8s Operator: full lifecycle automation

Provision Operator watches ChannelCR. On creation: creates namespace, deploys pod stack, creates schema on shared MySQL via proxy, mounts volume. Single CRD triggers the entire workflow.

Upgrade Operator rolling-restarts channel pods with new WA image. Tracks upgrade state per channel. Supports paused rollouts. No human intervention per cluster.

Migrate / Deprovision On customer churn: operator drains channel, snapshots schema, releases PV, removes from proxy routing table. Fully reversible. Audit trail in CRD status.

Why we didn't build it: WhatsApp Cloud API made it moot. But the Operator pattern was the missing abstraction, we were doing this manually across 6 clusters.

Staff-Level Hindsight: What I'd Architect Differently

Three things I know now that would have changed Day 1 decisions, and what Bird uniquely learned

What we did Started with a single shared queue. When it became a noisy neighbour problem, we added priority lanes reactively. Customer isolation came last, after the production incident.

What I'd do today Design for fairness from day one. A shared queue is fast to build but a single large customer will always find the ceiling. Per-customer isolated queues with priority lanes is the correct default for a multi-tenant messaging platform.

What we did Discovered Cloud SQL quota limits and GKE node limits by hitting them in production. Each wall was a reactive response: new GCP project, new cluster, quota raise request under pressure.

What I'd do today Model quota ceilings into capacity planning before onboarding the 1,000th channel, not the 95,000th. Channels → containers → nodes → Cloud SQL → GCP projects. Automated alerts at 70% of each ceiling.

What we did Processed outbound messages and inbound webhooks in the same pipeline. The 3x webhook amplification (9,000 RPS) competed with outbound sends for the same resources during peak campaigns.

What I'd do today Separate the write path (outbound sends) from the read path (webhook ingestion) from the start. The webhook storm at peak is predictable and should never compete with the send pipeline for capacity.

The unique asset Bird built was not the infra. It was the operational knowledge: 100k channels, 12,000 RPS peak, every failure mode from noisy neighbour to quota wall documented in production. That is what a Meta partnership pitch would have been worth.