WhatsApp at Scale: Conversations and Clusters

Staff Engineer Retrospective  ·  Bird (2019 – 2024)
100,000 channels · 12,000 RPS peak · 6 GKE clusters · 300,000 containers


Navigate with arrow keys  ·  Press F for fullscreen

Two Layers, Both Mine

The performance story lives in the conversations layer. The scaling story lives in the clusters. Both were my ownership.

Bird Platform
customers trigger campaigns, service messages, OTP
3,000 RPS outbound requests ↓↑ 9,000 RPS webhook callbacks
I OWNED THIS
Conversations Layer
Queue Management Receives, prioritises and paces outbound messages. Per-customer isolated queues with HIGH/MEDIUM/LOW lanes.
Rate Enforcement Distributes load across channels respecting Meta's 1,000 RPS ceiling per number.
Webhook Processing Ingests status callbacks (sent, delivered, read) and routes to customers.
State Store: Cloud Spanner Message state, delivery status, and per-customer queue metadata backed by Spanner. Strong consistency across regions without per-channel MySQL sharding overhead.
routed per channel ↓↑ status webhooks per message
I OWNED THIS TOO
WhatsApp On-Premise Clusters
6 GKE clusters
100,000 channels
300,000 containers
100,000 Cloud SQL
↓↑ Meta Network

The Scale Challenge: 3,000 RPS Into a 1,000 RPS Ceiling

Meta enforces a hard rate limit per phone number. Large customers like Shein blew past it on day one.

The inbound pressure

Shein runs a flash sale campaign: 3,000 messages per second into the conversations layer. All valid, all expected, all must be delivered.
3,000 RPS inbound

Meta's hard ceiling

Meta accepts a maximum of 1,000 messages per second per phone number. Non-negotiable. Exceed it and messages are rejected, not queued.
1,000 RPS ceiling per number

The webhook multiplier

Every message generates 3 status callbacks from Meta: sent, delivered, read. 3,000 RPS outbound becomes 9,000 RPS of webhooks to process simultaneously.
9,000 RPS webhook storm
The routing problem:
At 3,000 RPS and a 1,000 RPS ceiling, Shein needs at least 3 phone numbers active simultaneously. The conversations layer must distribute load across channels while tracking each number's live RPS budget.
Peak load on conversations layer:
3,000 RPS outbound + 9,000 RPS webhooks = 12,000 RPS total through a single service, with millisecond internal latency targets.

What Broke First: The Queue and the Noisy Neighbour

The queue grew because Meta drained slowly. One customer's campaign made everyone else wait.

The problem
1
Shein sends 3,000 RPS into a single shared queue
2
Meta drains at 1,000 RPS. Queue grows at 2,000 messages per second
3
A 10-minute Shein campaign injects 1.8M messages into the backlog
4
Every other customer's messages — including OTP logins — sit behind Shein's backlog for minutes
The business impact: A user trying to log in via OTP gets their code 4 minutes late. That's an authentication failure, a support ticket, and a customer trust problem.
The optimization journey
Phase 1: Vertical scaling Scaled up queue processor capacity. Made the drain faster but did not fix fairness. Shein still starved everyone else, just slightly less slowly. Hit a ceiling quickly.
Phase 2: Priority-based queuing Three lanes: HIGH (OTP), MEDIUM (service messages), LOW (marketing campaigns). OTP is now guaranteed fast delivery regardless of campaign volume. Fixed the critical path. Still noisy across customers within each lane.
Phase 3: Isolated customer queues Each customer gets their own queue per priority lane. Shein's 1.8M message backlog stays in Shein's lane. No other customer is affected. Per-customer rate limiting enforces Meta's ceiling independently.

The Atom: One Channel = One WABA Phone Number

WhatsApp's on-premise container imposes a hard 1:1:1 binding between a phone number, a database, and a volume

ONE CHANNEL (e.g. +31 6 1234 5678)
GKE Pod: WhatsApp On-Premise Container Stack
coreapp
500m CPU · 2 GB
message bus
master
500m CPU · 2 GB
1 pod always
web ×N
500m CPU · 2 GB
scales with RPS
Persistent Storage
Persistent Volume
100 GB SSD
media & message store
Cloud SQL Instance
dedicated per channel
no shared MySQL possible
Hard constraint #1
WhatsApp on-premise is not multi-tenant.
One MySQL instance per channel. Non-negotiable.
Hard constraint #2
Max 64 shards per account.
Horizontal scale is bounded per channel.
Design choice
Most channels run <5 RPS.
500m CPU · 2 GB RAM was sufficient.

Growth Was Linear. The Infrastructure Was Exponential.

Every new phone number a customer onboards multiplies the entire stack by one

One customer with 50 channels
×
150 containers (3 per channel)
×
50 Cloud SQL instances
×
5 TB provisioned storage
At 100,000 channels (peak)
!
300,000 containers running simultaneously
!
100,000 Cloud SQL instances across GCP projects
!
~10 petabytes provisioned storage
!
6 GKE clusters · 100+ GCP projects
Adding one customer was a product decision. But every phone number they onboarded was an infrastructure provisioning event: at full stack cost.

Current Architecture: 6 Clusters, One Rule

Rule One channel is assigned to exactly one cluster and never moves. Customers can span multiple clusters.

Cluster 1 · eu-west1
~16,700 channels
~50,000 containers
GCP projects: Cloud-SQL-01 → 17
channels from many customers
Cluster 2 · eu-west4
~16,700 channels
~50,000 containers
GCP projects: Cloud-SQL-18 → 34
channels from many customers
Cluster 3 · us-east1
~16,700 channels
~50,000 containers
GCP projects: Cloud-SQL-35 → 51
channels from many customers
Cluster 4 · us-west2
~16,700 channels
~50,000 containers
GCP projects: Cloud-SQL-52 → 68
channels from many customers
Cluster 5 · ap-southeast1
~16,700 channels
~50,000 containers
GCP projects: Cloud-SQL-69 → 85
channels from many customers
Cluster 6 · eu-north1
~16,500 channels
~49,500 containers
GCP projects: Cloud-SQL-86 → 100
channels from many customers
Why one channel per cluster?
A channel's state (MySQL, volume, pod) is fully contained in one cluster. No cross-cluster state to synchronise. Provisioning and deprovisioning are local operations.
The cost of this rule:
A customer with 50 channels could span 3 different clusters. Debugging their incident meant hopping clusters. Support tooling had to federate across all 6.

The Three Hard Walls

Each wall was a GCP or Kubernetes hard limit. Not something you tune, something you architect around

1,000
Nodes per GKE Cluster
GKE default hard limit. At ~500m CPU per channel and ~16 channels/node, this capped each cluster at roughly 16,000 channels before degraded scheduling.
Hit first → spawned Cluster 2
Trigger: node pressure alerts, evictions
100
Cloud SQL Instances per GCP Project
Default GCP quota. 100k channels required ~1,000 dedicated Cloud SQL instances per cluster at scale. Quota raise requests were slow. Multi-project became the only path.
Chronic pain → 100+ GCP projects
Trigger: provisioning failures for new channels
128
Persistent Disk Attachments per Node
GCE limit on how many PDs can attach to one VM. With 100 GB volumes per channel, scheduling had to stay aware of disk attachment counts, not just CPU/RAM. Added a constraint to the bin-packing problem.
Ongoing constraint → scheduling complexity
Trigger: pending pods with "too many volume attachments"
The compounding problem: Hitting Wall 1 forced multi-cluster. Multi-cluster increased the number of Cloud SQL instances per cluster boundary, accelerating Wall 2. All three walls were hit within 18 months of reaching scale.

The Operational Tax: Linear Complexity, Exponential Toil

Every one of these needed to be automated and kept working across 6 clusters simultaneously

🚀
Channel provisioning New channel = create namespace, deploy 3 containers, provision Cloud SQL, attach PV, register with control plane. Fully automated, but any step failing left the channel in a broken half-live state.
🔍
Debugging a broken channel First: which of 6 clusters? Then: which namespace? Then: is the PV mounted? Is Cloud SQL reachable? Which GCP project is its instance in? A support ticket was a 30-min cluster-hopping exercise without tooling.
🔄
WhatsApp version upgrades Meta releases new on-premise versions. Rolling 100k channels across 6 clusters without downtime required sequenced rollouts, rollback windows, and health checks per cluster, for every release.
💾
Cloud SQL backups 100k instances × automated backup windows = significant GCP billing and scheduling. Retention policies, snapshot audits, and restoration testing were recurring operational overhead.
⬆️
Cluster Kubernetes upgrades GKE version upgrades had to be coordinated across all 6 clusters. A bad upgrade on Cluster 1 would pause work on 2–6. Control plane upgrades required maintenance windows.
📊
Quota management ~100 GCP projects required continuous quota monitoring. A new Cloud SQL project had to be pre-created before the current one hit 100 instances. Quota request lead times were 2–5 business days.
The engineering cost didn't scale with users. It scaled with channels. Adding 10,000 new channels meant 10,000 new things that could break at 3am.

Strategic Trade-offs: What We Chose and What We Sacrificed

Every architectural decision was a deliberate bet. These are the three that mattered most.

DECISION: Single shared queue for all customers initially
One queue, simpler to build and operate. Priority lanes added later. Customer isolation added last after the noisy neighbour incident.
GAINED:
Fast to ship. No per-customer queue provisioning overhead. Simple mental model early on.
SACRIFICED:
Fairness. One large customer (Shein) made the entire platform feel slow for everyone else. OTP messages waited behind marketing campaigns.
DECISION: 500m CPU / 2 GB RAM minimum per channel
Allocate the minimum viable resources. Most channels run <5 RPS and rarely spike.
GAINED:
Dense node packing. Cost-efficient at scale. High channel-per-node density.
SACRIFICED:
Burst headroom. High-traffic campaigns caused OOM kills or throttling until manual HPA tuning.
DECISION: Cloud SQL over self-managed MySQL
Use GCP managed MySQL. No DBAs needed, automated backups, replication handled by GCP.
GAINED:
Zero MySQL ops overhead at launch. Automated failover. Encryption at rest out of the box.
SACRIFICED:
Hit Cloud SQL instance quota (100/project) with no good workaround. Cost at 100k instances was significant vs. self-managed.

Two Futures: The Fork in the Road

When on-premise hit its architectural ceiling, Bird had two credible paths forward

PATH A: Not Taken
MySQL Proxy Layer + K8s Operator
Build a MySQL proxy (ProxySQL / Vitess) that makes a shared MySQL cluster appear as isolated instances to each WA container: one schema per channel, one connection pool per channel.

A Kubernetes Operator handles the full channel lifecycle: provision, upgrade, migrate, deprovision, automatically, across all clusters.

This is what would have enabled Bird to pitch Meta as a WhatsApp hosting partner. We had the operational depth. The missing piece was the proxy abstraction.
Outcome: Keep on-premise, reduce cost 10×, pitch Meta for partnership
PATH B: Taken ✓
WhatsApp Cloud API Migration
Meta launched the WhatsApp Cloud API, a hosted version that eliminates the on-premise container entirely.

Bird migrated channels to the Cloud API, replacing the entire on-premise stack with API calls. No containers. No volumes. No Cloud SQL. No cluster management.

The operational tax evaporates. Engineering capacity shifts from keeping the lights on to building product features.
Outcome: Zero infra burden, faster feature velocity, Meta owns availability
Path A was the right technical bet. Path B was the right strategic bet. Meta building Cloud API meant the proxy layer would have competed with the platform vendor, a losing position long-term.

Future Architecture: What the Migration Actually Looks Like

Before and after: the same business capability, radically different infrastructure footprint

Before: On-Premise (per channel)
Bird Platform
↓ deploys & manages
WA Container Stack
coreapp + master + web
↓ connects to
Cloud SQL Instance
dedicated per channel
↓ mounts
Persistent Volume
100 GB SSD
↓ tunnels to
Meta Infrastructure
WhatsApp network
300,000 containers · 100k Cloud SQL · ~10 PB storage · 6 clusters · 100+ GCP projects
After: WhatsApp Cloud API
Bird Platform
↓ HTTPS API call
WhatsApp Cloud API
Meta-hosted · globally available
↓ delivers to
End Users
0 containers · 0 Cloud SQL instances · 0 volumes · 0 clusters to manage
Trade-off accepted: Availability SLA now owned by Meta, not Bird. Outage = vendor dependency. Rate limits enforced per phone number, not per cluster.

The Road Not Taken: MySQL Proxy Layer

How the architecture would have looked if Bird had solved the constraint instead of escaping it

Proposed: Proxy-mediated MySQL per cluster
WA Container (channel X)
connects to mysql://proxy:3306/channel_x
ProxySQL / Vitess Proxy (per cluster)
routes to schema channel_x · enforces isolation · connection pooling
Shared MySQL cluster (1 per GCP project)
1,000 schemas per instance · schema-per-channel isolation
Result: 100,000 channels → ~100 MySQL instances (vs. 100,000 Cloud SQL instances)
K8s Operator: full lifecycle automation
Provision Operator watches ChannelCR. On creation: creates namespace, deploys pod stack, creates schema on shared MySQL via proxy, mounts volume. Single CRD triggers the entire workflow.
Upgrade Operator rolling-restarts channel pods with new WA image. Tracks upgrade state per channel. Supports paused rollouts. No human intervention per cluster.
Migrate / Deprovision On customer churn: operator drains channel, snapshots schema, releases PV, removes from proxy routing table. Fully reversible. Audit trail in CRD status.
Why we didn't build it: WhatsApp Cloud API made it moot. But the Operator pattern was the missing abstraction, we were doing this manually across 6 clusters.

Staff-Level Hindsight: What I'd Architect Differently

Three things I know now that would have changed Day 1 decisions, and what Bird uniquely learned

1
What we did Started with a single shared queue. When it became a noisy neighbour problem, we added priority lanes reactively. Customer isolation came last, after the production incident.
What I'd do today Design for fairness from day one. A shared queue is fast to build but a single large customer will always find the ceiling. Per-customer isolated queues with priority lanes is the correct default for a multi-tenant messaging platform.
2
What we did Discovered Cloud SQL quota limits and GKE node limits by hitting them in production. Each wall was a reactive response: new GCP project, new cluster, quota raise request under pressure.
What I'd do today Model quota ceilings into capacity planning before onboarding the 1,000th channel, not the 95,000th. Channels → containers → nodes → Cloud SQL → GCP projects. Automated alerts at 70% of each ceiling.
3
What we did Processed outbound messages and inbound webhooks in the same pipeline. The 3x webhook amplification (9,000 RPS) competed with outbound sends for the same resources during peak campaigns.
What I'd do today Separate the write path (outbound sends) from the read path (webhook ingestion) from the start. The webhook storm at peak is predictable and should never compete with the send pipeline for capacity.
The unique asset Bird built was not the infra. It was the operational knowledge: 100k channels, 12,000 RPS peak, every failure mode from noisy neighbour to quota wall documented in production. That is what a Meta partnership pitch would have been worth.