If you build serious .NET systems on Azure long enough, you eventually hit the wall.
Not the “we need to refactor this controller” wall, but the “we can’t scale this anymore without redesigning the whole thing” wall.
For many teams, that wall shows up as REST-heavy microservices, synchronous calls everywhere, and a database that’s on fire every time the business runs a campaign.
This is where event-driven architecture stops being a buzzword and becomes a survival tactic.
1. When Event-Driven Starts Beating REST in Real Systems
REST and request/response are perfect until they aren’t. The failure modes are very predictable:
- Services become chatty: one frontend request fans out to 5–10 downstream HTTP calls.
- Everything is tightly coupled in time: if one service is slow, the user is slow.
- Databases are the only “integration point”, so every new feature adds more cross-service joins.
- Traffic spikes become production incidents instead of nice graphs on your dashboard.
In more than one project, I’ve seen a “simple” order placement flow involve:
- API → Orders service → Customers service → Pricing service → Inventory service → Payments service → Notifications service
Every arrow is synchronous HTTP. When Black Friday hits, you just chained your p99 latencies and multiplied your failure domains.
When it’s time to move to events
I usually push teams toward event-driven when the combination of these shows up:
- Throughput pressure: sustained high RPS, or unpredictable spikes (marketing, partners, IoT).
- Resilience requirements: “we can’t afford to lose orders, but we also can’t afford to block users.”
- Team autonomy: many teams want to react to the same business facts without tight coupling.
- Integration explosion: dozens of systems need to “know” that something happened.
Event-driven architectures help by:
- Decoupling producers and consumers in space and time.
- Smoothing load (queue-based load leveling is not theory; it saves your bacon when a 10x spike hits).
- Allowing multiple independent consumers to subscribe to the same events.
Real problems teams hit when they “go messaging”
Unfortunately, “let’s use Kafka” or “throw it on Service Bus” doesn’t magically solve anything. What I see repeatedly:
- Throughput ceilings: single consumer instance, no partitioning strategy, async APIs used synchronously.
- Hot partitions: all messages for a tenant or region hashed to a single partition.
- Ordering bugs: multiple consumers, retries, and retries changing processing order.
- Duplicate processing: at-least-once delivery mixed with non-idempotent consumers.
- Backpressure blindness: consumer lag and queue depth grow silently until you’re days behind.
On Azure, .NET + Kafka (or Event Hubs with Kafka protocol) + Azure Service Bus is a very strong combo for enterprise workloads. The trick is being deliberate about which tool does what, and designing for high throughput and low latency from day one.
2. Getting the Vocabulary Right: Events, Streams, and Messaging
Commands, events, and integration events
- Command: “Do this”. Directed to a specific service or aggregate. Has intent and often expectations about success/failure.
- Domain event: “This happened” inside a bounded context. Often internal to the service, but may be published externally.
- Integration event: A domain event shaped for other services. Stable contract, versioned, may omit internal details.
In practice, I advise teams:
- Use Service Bus queues/topics for commands and workflow-style integration events.
- Use Kafka / Event Hubs for high-volume domain events, telemetry, and streaming analytics.
Topics, queues, partitions, and consumer groups
Concept mapping matters for .NET devs coming from pure HTTP:
- Azure Service Bus Queue: point-to-point. One consumer gets each message.
- Azure Service Bus Topic + Subscriptions: pub/sub. Each subscription gets a copy.
- Kafka Topic: a named log, internally split into partitions.
- Kafka Partition: ordered, append-only log. Ordering is per partition.
- Kafka Consumer Group: the unit of parallelism and scaling. Each partition is consumed by at most one consumer in a group.
On Azure, when you use Event Hubs with Kafka protocol, the mental model is Kafka-like: topics + partitions + consumer groups.
Pull vs push consumption models
- Kafka: pull model. Your .NET consumer polls the broker, requesting batches. Backpressure is natural; if you slow down, lag increases.
- Service Bus: often used in push mode (message handlers / processors), but can be pull (Receive/ReceiveBatch) as well. Backpressure is tuned via prefetch and concurrency.
Be careful with auto-complete / auto-acknowledge: that’s how teams accidentally turn “at-least-once” into “best-effort”.
Delivery semantics: at-most-once, at-least-once, exactly-once
- At-most-once: you won’t see duplicates, but you may lose messages. Acceptable for logs, metrics, “best-effort” analytics.
- At-least-once: you won’t lose messages, but you may see duplicates. This is the default for Service Bus and Kafka.
- Exactly-once: each message is processed once or the effect is as-if-once. In practice, “exactly-once effects” via idempotency + transactions.
Both Azure Service Bus and Kafka support mechanisms that
- Service Bus: transactions, duplicate detection, sessions.
- Kafka: idempotent producers, transactions (producer + consumer offsets), partition-level ordering.
Event sourcing vs event-carried state vs simple integration events
- Event sourcing: the event log is the source of truth. Complex, powerful, not required for 90% of systems.
- Event-carried state transfer: events include enough state for consumers to update their own models (no read-after-write HTTP calls back).
- Simple integration events: “Something happened”, somewhat minimal payload, consumers can call back if needed.
For high-throughput systems, I lean heavily on event-carried state transfer to avoid N+1 integration calls. Event sourcing I reserve for domains that absolutely need an audit log and full replay (finance, critical workflows).
3. Kafka vs Azure Service Bus: Making the Right Call on Azure
Architectural differences that matter
- Kafka / Event Hubs: log-based, stream-first. Topics are partitioned logs, consumers track offsets, multiple consumer groups can read the same stream independently, replay is natural.
- Azure Service Bus: brokered messaging, queue semantics. Each message is “owned” by one consumer in a subscription or queue. Replay is possible via dead-letter / requeue but not a first-class concept.
Throughput and latency on Azure
- Kafka / Event Hubs
- Designed for millions of events/second, especially with Event Hubs Standard/Dedicated.
- Excellent for high-volume telemetry, logging, event streams.
- Latency can be low (<10ms–tens of ms) but often traded for batching efficiency.
- Service Bus Premium
- Thousands of messages/sec per queue/topic with predictable low latency.
- Premium uses dedicated resources, so tail latency is generally better than Standard.
- Better fit for “business messages” than raw telemetry.
Operational considerations
- Kafka on AKS
- Full control, full responsibility: brokers, Zookeeper/KRaft, storage, tuning, upgrades.
- This is operations work. Don’t underestimate it.
- Azure Event Hubs (Kafka protocol)
- You get the Kafka client model with a managed backend.
- No broker management, but some features differ from vanilla Kafka.
- Azure Service Bus
- Fully managed, very stable, battle-tested in enterprise workflows.
- Advanced features: sessions, dead-letter queues, transactions, duplicate detection.
Workload fit and a pragmatic decision matrix
My typical guidance:
- Use Kafka / Event Hubs when:
- You need high-volume streams (IoT, telemetry, clickstreams, logs).
- You want multiple independent consumers (real-time analytics, AI, monitoring, ML features).
- Replay and time-travel are important.
- Use Service Bus when:
- You have business workflows, commands, sagas, and orchestrations.
- You need per-entity or per-session ordered processing.
- You care about transactional send/receive semantics and dead-lettering.
- Use both when:
- You have transactional workflows that emit high-volume events consumed by analytics pipelines.
- You want a clean separation: “business messaging” vs “data/analytics streams”.
4. Designing High-Throughput Producers and Consumers in .NET
Partitioning and keying: the heart of scalability
For Kafka/Event Hubs, your partitioning strategy decides your future:
- Messages with the same key go to the same partition (hash(key)), preserving per-key ordering.
- Partitions are the unit of parallelism: max effective consumer instances ≈ partition count.
Common patterns:
- Key by aggregate or entity ID (OrderId, CustomerId) to preserve per-entity ordering.
- Key by logical shard (TenantId, Region) if you can tolerate reordering across entities within that shard.
- Avoid “all events same key” unless you want a bottleneck (I’ve seen “global” key used in prod… it did not end well).
For Service Bus:
- Sessions act like lightweight partitions with ordering per SessionId.
- Be careful not to create a “hot session” with all messages sharing one SessionId.
Batch vs single-message send/receive
High throughput requires batching on both ends.
- Kafka producer
linger.ms: wait a few ms to batch more messages together.batch.size: max batch size in bytes.- You almost never want “send one message, flush, wait” in a tight loop.
- Service Bus
- Use
SendMessagesAsync/ReceiveMessagesAsyncto send/receive batches. - Tune
prefetchCountandmaxConcurrentCallsto keep the pipeline full.
- Use
Async I/O, connection pooling, serialization
A few battle-tested rules:
- Use singletons of producer/consumer clients (Kafka, Service Bus). Don’t create per-message clients.
- Use fully async code; avoid blocking on async with
.Resultor.Wait(). - Consider binary formats (Protobuf, Avro) when payload size dominates. For typical business events, JSON with System.Text.Json is usually OK.
- Be explicit about compression if your events are large and repetitive (Kafka supports producer compression).
Tuning Service Bus in .NET
For high-throughput consumers using Azure.Messaging.ServiceBus:
- Prefetch: set
PrefetchCountto a multiple ofMaxConcurrentCalls(e.g., 5–20×). - Concurrency: set
MaxConcurrentCallsbased on CPU cores and downstream dependencies (DB, HTTP). - Lock duration: ensure processing + retries can complete before lock expiry or implement auto-renewal.
- Sessions: use
MaxConcurrentSessionsandMaxConcurrentCallsPerSessionto parallelise across sessions.
Tuning Kafka clients from .NET
For Confluent.Kafka in .NET, a few key knobs:
- Producer:
acks=allfor stronger durability vsacks=1for lower latency.linger.ms(e.g., 5–20 ms) to allow batching.batch.sizetuned to your payload size and throughput.compression.type(snappy, lz4) for better throughput when CPU is cheap.
- Consumer:
fetch.min.bytesandfetch.max.wait.msto control batching.- Manual commit with idempotent processing to avoid message loss.
5. Ordering, Idempotency, and “Exactly-Once” in the Real World
Accepting that true exactly-once is rare
End-to-end, cross-service, cross-datastore exactly-once processing is mostly a fiction outside tightly controlled, specialized systems.
What you can realistically achieve is exactly-once effects for your consumers:
- Each event may be delivered more than once.
- Your consumer processes it in an idempotent way such that the externally observable state is as if it was processed once.
Designing idempotent consumers in .NET
Common approaches I’ve used in production:
- Idempotency keys per event
- Use message ID, event ID, or a deterministic business key (e.g.,
OrderId + EventType + Version). - Store processed IDs in a deduplication store (SQL, Redis) with TTL if appropriate.
- Use message ID, event ID, or a deterministic business key (e.g.,
- Upserts instead of inserts
- Design your write model so that applying the same event twice leads to the same result.
- State-based guards
- “If order already in Shipped status, ignore another ShipOrder event.”
The outbox pattern (store outgoing messages in the same transaction as your state change, then publish from there) is critical to avoid “state changed but event lost” scenarios.
Ordering with Kafka partitions and Service Bus sessions
- Kafka: ordering is guaranteed per partition. To maintain ordering for a given entity, always use the same partition key (e.g.,
OrderId). - Service Bus: ordering is guaranteed within a session. Use
SessionIdas your key (e.g.,OrderId).
A common production bug I’ve seen: team introduces a new consumer group (or new session handler) for the same logical flow without respecting ordering keys. Suddenly events are processed out of order and downstream aggregates corrupt.
Schemas that survive replay and versioning
Your event schema should support:
- Schema evolution: add fields with defaults, don’t rename/remove without a plan.
- Replay: events must be self-contained enough for a new consumer to build state.
- Event versioning: include an explicit version field or use envelope metadata.
In Kafka ecosystems, Avro or Protobuf with schema registry is common. On Azure with .NET, JSON + explicit versioning is often sufficient as long as you’re disciplined.
Compensating actions and out-of-order events
When you
- Make events carry sequence numbers or logical timestamps.
- Use a small reordering window in your consumer (buffer recent events per key).
- Design compensating events when strict order can’t be preserved (e.g., “PaymentReversed” after “PaymentCompleted”).
6. Backpressure, Throttling, and Failure Handling
How backpressure manifests in .NET event-driven services
Look for:
- Growing queue depth (Service Bus) or consumer lag (Kafka).
- CPU at 100% and high GC activity in consumers.
- DB / downstream service saturation (connection pool exhaustion, timeouts).
- Increasing processing latency per message.
Consumer concurrency vs partition count vs CPU
A rule of thumb I use:
- Kafka: target consumer instances ≈ partitions, then tune intra-process concurrency carefully.
- Service Bus: tune
MaxConcurrentCallsandPrefetchCountper process based on CPU cores, then horizontally scale instances.
Retries, DLQs, and poison messages
- Service Bus:
- Use automatic retries for transient errors.
- Configure dead-letter queues for messages that fail too many times.
- Build a DLQ inspection and replay tool. You will need it.
- Kafka:
- Use retry topics or DLQ topics with delayed retries instead of tight loops.
- Be careful not to block an entire partition on a single poison message.
Azure-native throttling and auto-scaling
- AKS / Container Apps: scale out consumer pods/containers based on queue depth or consumer lag.
- Azure Functions: built-in scaling for Service Bus and Event Hubs triggers, but you still need to respect downstream limits.
- Combine with circuit breakers and bulkheads when hitting external services.
Flow control in .NET clients
- Service Bus: decrease
PrefetchCountandMaxConcurrentCallsto reduce pressure, increase to raise throughput. - Kafka: adjust consumer poll frequency and batch sizes; slowdown manifests as lag, not immediate failure.
7. A Concrete Azure Reference Architecture
Let’s sketch a realistic architecture I’ve used variations of:
Public API → Command Bus (Service Bus) → Domain Service(s)
│ │
│ ├─ Persist business state (SQL/NoSQL)
│ └─ Publish integration/domain events
│ │
│ Kafka / Event Hubs topics
│ │
└── Read models, projections, analytics, ML features, notifications
Mixing Kafka and Service Bus cleanly
- Service Bus handles:
- Commands from API or other systems.
- Transactional workflows, sagas, orchestrations.
- Kafka/Event Hubs handles:
- Domain events at scale.
- Telemetry, clickstreams, IoT data.
- Analytics, projections, materialized views.
.NET building blocks: minimal APIs and worker services
I favour three kinds of processes:
- API services: ASP.NET Core minimal APIs, accept commands, validate, send to Service Bus.
- Domain workers: ASP.NET Core Worker Services hosted in AKS/Container Apps, consuming from Service Bus, using outbox to publish domain events to Kafka or back to Service Bus topics.
- Projection/analytics workers: dedicated Kafka consumers updating read models or pushing to downstream analytics stores.
Azure Functions, AKS, Container Apps
- Azure Functions: great for bursty or glue workloads (e.g., reacting to a subset of events to send emails).
- AKS: favoured for long-running, stateful-ish, or latency-sensitive consumers.
- Container Apps: nice middle ground with simpler ops than AKS for many teams.
Multi-tenant and multi-region thoughts
- Partition by tenant (Kafka partition key or Service Bus SessionId) to keep tenant traffic somewhat isolated.
- Multi-region:
- Kafka: use geo-replication (MirrorMaker 2, Confluent Replicator, or Event Hubs geo features).
- Service Bus: active/passive or active/active with clear ownership of entities.
8. Observability and Capacity Planning
What to measure
- Kafka:
- Consumer lag (e.g.,
records-lag-max). - Produce/consume throughput.
- Broker I/O, disk, and network.
- Consumer lag (e.g.,
- Service Bus:
- Queue/topic depth, dead-letter queue depth.
- Processing latency and failure rates.
- Lock lost / abandon / defer patterns.
- .NET services:
- Event processing time per handler.
- DB and downstream dependency latency/error rates.
- GC and thread pool metrics.
Tools that actually get used
- Azure Monitor & Application Insights: first stop for metrics and distributed traces.
- Prometheus/Grafana: especially for Kafka on AKS or custom exporters.
- Log aggregation (Azure Monitor Logs, ELK, etc.): for correlation IDs and incident analysis.
Capacity planning for event-driven systems
For each topic/queue, define:
- Expected and peak events/sec.
- Typical message size.
- Max tolerable lag (seconds/minutes).
- Number of partitions / messaging units / premium messaging units.
Then back into:
- How many consumer instances you need.
- How much you need to batch per poll/receive.
- What your storage and downstream capacity must be.
Load testing event-driven flows
Don’t only load test the HTTP layer. Simulate:
- Peak RPS into APIs that enqueue commands.
- Peak event volumes on Kafka/Event Hubs topics.
- Slowdowns and outages in downstream dependencies while keeping ingestion high.
Watch for lag and queue depth, not just CPU.
Runbooks you actually need
- Hot partition: detect (lag only on one partition), add partitions if possible, redistribute keys, or reshard tenants.
- Consumer lag spike: scale out consumers, temporarily relax processing, or shed non-critical workloads.
- Dead-letter spikes: inspect messages, spot systemic issues (bad deployment, schema change) vs real poison messages.
9. Implementation Guidelines for .NET Teams
Standardizing event contracts
- Define clear event naming conventions (e.g.,
OrderPlacedV1). - Keep integration event schemas in a shared repository with versioning.
- Document which topics/queues carry which event types and who owns them.
Library choices
- Kafka:
Confluent.Kafkafor low-level client, possibly wrapped in your own abstractions. - Service Bus:
Azure.Messaging.ServiceBusas the go-to SDK. - Frameworks (where appropriate): MassTransit, NServiceBus, Dapr, etc. can help if you accept their abstractions and conventions.
Secure messaging on Azure
- Use managed identities for auth from your .NET services to Service Bus / Event Hubs / storage.
- Lock down with RBAC and least privilege (send vs listen vs manage).
- Ensure encryption in transit (TLS) and at rest (Azure handles storage, but be mindful of any custom persistence).
Coding patterns: channels and bounded queues
Inside your .NET service, separate broker I/O from business processing using bounded channels. This lets you:
- Control internal concurrency.
- Avoid blocking the broker client when downstream is slow.
- Implement backpressure internally.
Migration from REST-heavy to event-driven
- Start by introducing asynchronous messaging for non-critical flows that are causing pain (notifications, some integrations).
- Introduce integration events from key domain services and let new consumers subscribe.
- Gradually remove synchronous HTTP calls between services, replacing them with commands & events.
10. The Bottom Line
High-throughput, low-latency event-driven architecture on Azure with .NET, Kafka, and Service Bus isn’t a framework choice; it’s a set of design decisions:
- Picking the right tool for the job (Kafka/Event Hubs vs Service Bus).
- Designing partitions, sessions, and keys for scalability and ordering.
- Implementing idempotent consumers with at-least-once delivery.
- Handling backpressure and failure as first-class concerns.
- Investing in observability and runbooks before you’re in an incident.
Done right, you end up with a system that can absorb spikes, evolve faster, and give teams more autonomy. Done casually, you just move your outages from HTTP to Kafka or Service Bus.
Design it deliberately, measure it ruthlessly, and treat messaging infrastructure as a core part of your architecture, not an afterthought.