Skip to content

Message Queues & Event-Driven Architecture

A message queue is a buffer that decouples the producer of a message from its consumer. Instead of Service A calling Service B directly and waiting for a response, A drops a message into a queue and continues working. B processes it at its own pace.

This simple shift — from synchronous to asynchronous — unlocks reliability, scalability, and flexibility patterns that are impossible in direct request-response architectures.


User Request → [Order Service] → [Payment Service] → [Inventory Service] → [Email Service]
If this is slow or down,
the entire chain fails

In a synchronous chain:

  • Availability multiplies downward: If each service is 99.9% available, a 4-service chain is ~99.6% available
  • Latency adds up: Each hop adds latency; the user waits for the entire chain
  • Cascading failures: One slow service blocks all upstream services
User Request → [Order Service] → [Queue] → [Payment Consumer]
→ [Inventory Consumer]
→ [Email Consumer]
  • Order Service succeeds immediately after enqueuing — it doesn’t wait for downstream services
  • Payment, Inventory, and Email process the message independently and at their own pace
  • If Email is down, messages queue up; it processes them when it recovers — no messages lost

One producer, one consumer. Each message is delivered to exactly one consumer instance.

[Producer] → [Queue] → [Consumer A]
(Consumer B never sees this message)
  • Use for: task distribution, job queues, load-leveling
  • Each message represents a unit of work to be done exactly once

One producer, many consumers. Each subscriber gets its own copy of every message.

→ [Consumer: Analytics]
[Producer] → [Topic]→ [Consumer: Notification Service]
→ [Consumer: Audit Logger]
  • Use for: event broadcasting, fan-out, event-driven architectures
  • Publishers don’t know who’s listening; adding a new subscriber requires no change to the publisher

These are the two dominant open-source messaging systems, with fundamentally different designs.

Kafka is a distributed commit log. Messages are written to an immutable, ordered log (a topic), partitioned for parallelism. Consumers read from the log at their own offset and can replay messages.

Architecture:

  • Topic: A named log stream
  • Partition: Ordered, immutable sub-log within a topic. The unit of parallelism.
  • Consumer Group: Multiple consumers sharing a topic, each partition consumed by one member at a time
  • Offset: The position of a consumer in the log. Consumer-managed; can be replayed.

Strengths:

  • Extremely high throughput (millions of messages/sec)
  • Message retention: messages stay on disk for a configurable period (hours to weeks) — consumers can replay history
  • Naturally ordered within a partition
  • Excellent for event sourcing, audit logs, stream processing (Kafka Streams, Flink)

Weaknesses:

  • Operationally complex (Zookeeper / KRaft, replication, partition management)
  • No native routing / header-based filtering
  • Not designed for individual message acknowledgement / dead-lettering at a fine-grained level

Use Kafka for: High-volume event streams, audit logs, data pipelines, analytics, inter-service event buses.

RabbitMQ is a traditional message broker. Messages are routed through exchanges to queues based on routing rules. Once a consumer acknowledges a message, it’s gone.

Architecture:

  • Exchange: Receives messages from producers and routes them to queues based on type (direct, topic, fanout, headers)
  • Queue: Holds messages waiting for a consumer
  • Binding: Rules connecting an exchange to a queue

Strengths:

  • Rich routing (header-based, pattern-based, fanout)
  • Per-message acknowledgement and dead-letter queues
  • Simpler operational model than Kafka for low-to-medium throughput
  • Priority queues, delayed messages, TTLs on messages

Weaknesses:

  • No built-in message replay (messages are deleted after acknowledgement)
  • Throughput ceiling lower than Kafka

Use RabbitMQ for: Task queues, workflow orchestration, routing-heavy patterns, anything requiring per-message acknowledgement and dead-lettering.

AspectKafkaRabbitMQ
ModelDistributed log (pull)Message broker (push)
Message retentionDays to weeks (replay possible)Until acknowledged (no replay)
ThroughputVery high (millions/sec)High (tens of thousands/sec)
RoutingTopic + partition keyExchange types (direct, topic, fanout)
OrderingPer partitionPer queue (single consumer)
Ops complexityHighLow-Medium
Best forEvent streams, analyticsTask queues, microservice messaging
ServiceProviderBased On
Amazon SQSAWSCustom (queue semantics)
Amazon SNSAWSPub/Sub
Amazon EventBridgeAWSEvent bus with schema registry
Azure Service BusAzureEnterprise messaging
Google Pub/SubGCPGlobal pub/sub

One of the most critical properties of a messaging system is what it promises about message delivery.

GuaranteeMeaningRisk
At-most-onceMessage delivered 0 or 1 times; may be lostData loss if consumer crashes before processing
At-least-onceMessage delivered 1+ times; may be duplicatedMust handle duplicates (idempotency required)
Exactly-onceMessage delivered exactly onceHardest to achieve; high overhead

An idempotent consumer can safely receive and process the same message multiple times:

def handle_payment_event(event: dict):
payment_id = event["payment_id"]
# Check: has this payment already been processed?
if db.exists("SELECT 1 FROM processed_payments WHERE id = ?", payment_id):
return # Already processed — silently skip
# Process the payment
process_payment(event)
# Mark as processed
db.execute("INSERT INTO processed_payments (id) VALUES (?)", payment_id)

The deduplication key (payment_id here) is the message’s idempotency key. This should be a stable, unique identifier generated by the producer.


A DLQ is a queue that receives messages that couldn’t be processed successfully after N retry attempts. Instead of losing the message or blocking normal flow, it’s parked for inspection and reprocessing.

[Normal Queue] → [Consumer] (fails 3 times)
[DLQ] ← [Ops team investigates / replays]

Always configure a DLQ in production. Without one, poison messages (messages that always cause processing failures) block queues or cause infinite retry loops.


Services react to events independently. No central orchestrator.

[Order Placed Event]
→ Payment Service (subscribes, charges card, emits "Payment Processed")
→ Inventory Service (subscribes, reserves stock, emits "Stock Reserved")
→ Email Service (subscribes to "Payment Processed", sends confirmation)
  • Pros: Loose coupling, services evolve independently
  • Cons: Hard to trace the overall flow; debugging requires aggregating events across services

A central orchestrator service coordinates the workflow.

[Order Orchestrator]
→ calls Payment Service (waits for result)
→ calls Inventory Service (waits for result)
→ calls Email Service (waits for result)
  • Pros: Clear single source of truth for workflow state; easier to monitor and debug
  • Cons: Orchestrator becomes a coupling point; can become a bottleneck

Use choreography for simple, loosely coupled event flows. Use orchestration for complex, long-running workflows with compensation logic (see Saga pattern below).

For distributed transactions spanning multiple services (each with their own database), the Saga pattern breaks the transaction into a sequence of local transactions, each publishing an event to trigger the next.

If any step fails, compensating transactions roll back the previous steps:

1. Order Service: Create Order → emit "OrderCreated"
2. Payment Service: Charge Card → emit "PaymentProcessed"
3. Inventory: Reserve Stock → emit "StockReserved"
// If step 3 fails:
3'. Inventory: (fails) → emit "StockReservationFailed"
2'. Payment Service: Refund Card → emit "PaymentRefunded" (compensating)
1'. Order Service: Cancel Order → (saga complete)

Backpressure is the mechanism by which a consumer signals to a producer to slow down when it’s overwhelmed.

Without backpressure, a fast producer can overwhelm a slow consumer, causing queue depth to grow unboundedly and eventually crash the consumer or exhaust memory.

Techniques:

  • Consumer-controlled polling: Consumer pulls messages at its own pace (Kafka’s model — consumers poll, they’re not pushed to)
  • Prefetch limit: RabbitMQ basic.qos limits how many unacknowledged messages a consumer receives
  • Queue depth alerting: Alert when queue depth exceeds threshold; scale consumers or investigate slow processing