Most teams discover idempotency and high-throughput messaging the hard way: after they’ve corrupted data in production or melted a downstream system under load.
In this post, I’ll walk through how I design high-throughput, idempotent message processing pipelines using Azure Service Bus and .NET, based on years of running these patterns in production.
Why idempotent, high-throughput messaging changes how you design .NET systems
Let’s level-set the constraints first, because they drive almost every design decision:
- Azure Service Bus is at-least-once delivery. Duplicates will happen. Network blips, lock expiry, client retries – all can cause redelivery.
- High throughput means concurrency. You’ll run multiple consumers (competing consumers pattern), each with multiple concurrent handlers.
- State mutation is where you get hurt. Idempotency is usually not about your message handler code; it’s about how many times you mutate your database or external systems.
- Back-pressure is non-negotiable. Your queue will absorb spikes; your processing pipeline must not blindly hammer downstream dependencies.
Once you accept these, you stop asking “How do I avoid duplicates?” and start asking “How do I survive duplicates while keeping throughput high?”
Choosing the right Azure Service Bus primitives for high throughput
Before writing any .NET code, you need to shape the messaging topology around your workload. Azure Service Bus gives you just enough flexibility to shoot yourself in the foot if you don’t think it through.
Queues vs topics: point-to-point vs fan-out
- Queue – single logical consumer group. Perfect for work queues, background processing, batch jobs.
- Topic + subscriptions – pub/sub and fan-out. Multiple independent subscribers can process the same event stream differently.
For high-throughput pipelines:
- Use queues for command-like “do this work” scenarios where exactly one processing lane is required.
- Use topics when multiple services need to react to the same business event independently.
Standard vs Premium, partitioned vs non-partitioned
You’re trading cost, throughput, and latency:
| Tier / Option | Pros | Cons | When I choose it |
|---|---|---|---|
| Standard, non-partitioned | Simpler, cheaper | Lower throughput, no isolation | Dev/test, low-volume workloads |
| Standard, partitioned | Higher throughput, better availability | No sessions across partitions | High throughput but cost-sensitive, no strict FIFO per key |
| Premium | Dedicated resources, predictable latency, AZ support | More expensive | Mission-critical, very high throughput, noisy neighbors unacceptable |
Sessions: ordered, related message streams
Sessions give you FIFO per sessionId. Service Bus will ensure:
- Messages with the same
SessionIdare processed in order. - Only one
ServiceBusSessionReceiverhas a given session at a time.
They’re a good fit when:
- You’re processing a per-user or per-aggregate stream (e.g., all operations for
OrderId). - Ordering inside that stream actually matters.
The catch: sessions hurt raw parallelism if you only have a few hot keys. You need enough distinct SessionId values to spread load across multiple consumers.
Partitioning and partition keys
Partitioned queues/topics in Standard tier distribute messages across partitions; each partition is an independent broker and store. You can optionally provide a PartitionKey to keep related messages together.
- Use a stable, high-cardinality partition key (e.g.,
CustomerId) to spread load. - Don’t use the same fixed key for everything unless you like single-partition bottlenecks.
Duplicate detection on the broker
Service Bus can do some deduplication for you via duplicate detection:
- Configured per entity with a duplicate detection window (20 seconds to 7 days).
- Uses
MessageIdhistory to discard duplicates sent within that window.
It’s useful, but not enough by itself:
- Deduplication is only on send, not on receive semantics.
- It doesn’t protect you from consumer-side duplicates (e.g., message lock expiry and redelivery).
Designing idempotent handlers in .NET: stop thinking in “try/catch” and start thinking in “state transitions”
Most idempotency failures I’ve debugged weren’t about exotic distributed systems theory. They were simple things like “we created the invoice twice” or “we charged the customer twice”.
The rule I drill into teams is:
If the same message arrives 10 times, your final system state must be the same as if it arrived once.
— hard-earned production lesson
Anatomy of a robust message handler
In .NET with Azure.Messaging.ServiceBus, your handler usually ends up shaped like this:
public class OrderCreatedMessage
{
public string MessageId { get; set; } = default!; // business message id, not broker id
public string OrderId { get; set; } = default!;
public decimal Amount { get; set; }
public DateTimeOffset OccurredAt { get; set; }
}
public class OrderCreatedHandler
{
private readonly IIdempotencyStore _idempotencyStore;
private readonly IOrdersService _ordersService;
private readonly ILogger<OrderCreatedHandler> _logger;
public OrderCreatedHandler(
IIdempotencyStore idempotencyStore,
IOrdersService ordersService,
ILogger<OrderCreatedHandler> logger)
{
_idempotencyStore = idempotencyStore;
_ordersService = ordersService;
_logger = logger;
}
public async Task HandleAsync(
OrderCreatedMessage message,
ServiceBusReceivedMessage raw,
CancellationToken cancellationToken)
{
// 1. Check idempotency first – fast path if already processed
if (await _idempotencyStore.IsProcessedAsync(message.MessageId, cancellationToken))
{
_logger.LogInformation("Skipping already processed message {MessageId}", message.MessageId);
return;
}
// 2. Attempt to mark as "in-progress" (idempotency guard)
var acquired = await _idempotencyStore.TryAcquireAsync(message.MessageId, cancellationToken);
if (!acquired)
{
_logger.LogInformation("Another worker is processing message {MessageId}", message.MessageId);
return; // let the other worker finish; duplicate will be safe
}
try
{
// 3. Execute side effects (DB, HTTP, etc.)
await _ordersService.CreateOrderAsync(message, cancellationToken);
// 4. Mark as completed
await _idempotencyStore.MarkCompletedAsync(message.MessageId, cancellationToken);
}
catch (Exception ex)
{
_logger.LogError(ex, "Failed to handle message {MessageId}", message.MessageId);
// optional: mark as failed with reason; let retry policy decide
await _idempotencyStore.MarkFailedAsync(message.MessageId, ex, cancellationToken);
throw; // surface to Service Bus processor so it can retry / DLQ
}
}
}
This pattern relies on a durable IIdempotencyStore that can coordinate between workers.
What makes a handler truly idempotent?
You must audit every side effect and make sure it’s either:
- Idempotent by contract (e.g., PUT /resource/123 with same payload multiple times).
- Protected by idempotency keys (e.g., payment provider APIs that accept an idempotency key).
- Wrapped in your own deduplication logic using a durable store.
Things that are not naturally idempotent:
- INSERT-only SQL operations without unique constraints.
- External calls that create new resources without idempotency keys.
- Email/SMS sending (users notice duplicates very quickly).
Deduplication strategies: getting to “exactly once effect”
You will not get exactly-once delivery from Service Bus. You can, however, get exactly-once effect if you control how side effects are applied.
Choosing your idempotency key
The most important design choice is what uniquely identifies the business operation. Options:
- Business operation ID – e.g.,
PaymentId,OrderId. Good when 1:1 with the effect. - Client-generated message ID – a GUID generated by the publisher and carried through as
MessageIdin the payload. - Composite key – e.g., (
OrderId,EventType,Version) when one aggregate receives multiple event types.
Use that as your primary idempotency key; don’t rely on the broker-assigned ServiceBusReceivedMessage.MessageId unless you’re very deliberate about it.
Implementing a durable idempotency store
You can implement IIdempotencyStore in a few ways; each has trade-offs.
1. SQL-based idempotency table (my default for most systems)
Example table:
CREATE TABLE MessageProcessingStatus
(
MessageId NVARCHAR(100) NOT NULL,
Status TINYINT NOT NULL, -- 0=InProgress,1=Completed,2=Failed
ProcessedAtUtc DATETIME2 NULL,
FailedReason NVARCHAR(500) NULL,
CONSTRAINT PK_MessageProcessingStatus PRIMARY KEY (MessageId)
);
And a simple implementation using EF Core-style pseudo code:
public class SqlIdempotencyStore : IIdempotencyStore
{
private readonly MyDbContext _db;
public SqlIdempotencyStore(MyDbContext db) => _db = db;
public async Task<bool> IsProcessedAsync(string messageId, CancellationToken ct)
{
return await _db.MessageStatuses
.AnyAsync(x => x.MessageId == messageId && x.Status == Status.Completed, ct);
}
public async Task<bool> TryAcquireAsync(string messageId, CancellationToken ct)
{
try
{
_db.MessageStatuses.Add(new MessageStatus
{
MessageId = messageId,
Status = Status.InProgress,
ProcessedAtUtc = null
});
await _db.SaveChangesAsync(ct);
return true;
}
catch (DbUpdateException ex) when (IsUniqueConstraintViolation(ex))
{
// Someone else inserted it first; let them process
return false;
}
}
public async Task MarkCompletedAsync(string messageId, CancellationToken ct)
{
var entity = await _db.MessageStatuses.FindAsync(new object[] { messageId }, ct);
if (entity is null) return; // or throw
entity.Status = Status.Completed;
entity.ProcessedAtUtc = DateTime.UtcNow;
await _db.SaveChangesAsync(ct);
}
public async Task MarkFailedAsync(string messageId, Exception ex, CancellationToken ct)
{
var entity = await _db.MessageStatuses.FindAsync(new object[] { messageId }, ct);
if (entity is null) return;
entity.Status = Status.Failed;
entity.FailedReason = ex.Message.Truncate(500);
entity.ProcessedAtUtc = DateTime.UtcNow;
await _db.SaveChangesAsync(ct);
}
}
This pattern leans on the primary key constraint to ensure only one worker “wins” the right to process a given message.
2. Redis / distributed cache
For ultra-low latency and simple “processed key” checks, Redis works well:
- Use
SETNX(set if not exists) to acquire. - Set a TTL based on your retention requirements.
But beware:
- Caches can be evicted; your guarantees are weaker than with a database.
- You need to tune TTLs carefully to balance memory vs duplicate risk.
3. Piggybacking on business tables
Sometimes the cleanest solution is not a separate idempotency table. For example:
Ordershas a unique constraint onExternalOrderId.- If you try to insert the same order twice, you get a constraint violation.
This can give you idempotent effects without a separate idempotency store – just treat unique constraint violations as “already done”.
Failures, retries, and poison messages: how to not burn the house down
Every high-throughput pipeline eventually meets its poison messages – payloads that always fail because of validation issues, data assumptions, or downstream invariants.
Understanding Service Bus delivery and DLQ behavior
With peek-lock mode (which you should use), the flow is:
- Message is locked for a duration (default 30s, up to 5 minutes).
- You process it, then explicitly
CompleteAsync,AbandonAsync,DeferAsync, orDeadLetterAsync. - If your handler throws or the lock expires, the message becomes available again and delivery count increments.
- Once
MaxDeliveryCountis exceeded (default 10), Service Bus moves it to the dead-letter queue (DLQ).
Retry policies: transient vs permanent failures
I split errors into roughly two buckets:
- Transient – timeouts, 5xx from downstream, throttling. Worth retrying with exponential backoff + jitter.
- Permanent – validation errors, domain rule violations, bad data. Retrying blindly just delays the inevitable and increases noise.
I usually implement a layered strategy:
- Inside the handler: use Polly (or similar) to retry transient downstream calls a few times with short backoff.
- If still failing, throw; let Service Bus redeliver up to
MaxDeliveryCount. - After
MaxDeliveryCount, message moves to DLQ – treat DLQ as an inbox for manual / semi-automated inspection.
var retryPolicy = Policy
.Handle<HttpRequestException>()
.Or<TimeoutException>()
.WaitAndRetryAsync(3, attempt => TimeSpan.FromSeconds(Math.Pow(2, attempt))); // 2,4,8s
public async Task HandleAsync(OrderCreatedMessage msg, CancellationToken ct)
{
await retryPolicy.ExecuteAsync(async () =>
{
await _ordersService.CreateOrderAsync(msg, ct);
});
}
Explicit dead-lettering with context
Don’t always rely on MaxDeliveryCount. Sometimes you know immediately the message is hopeless (e.g., invalid schema). In that case:
await args.DeadLetterMessageAsync(
args.Message,
deadLetterReason: "ValidationFailed",
deadLetterErrorDescription: "CustomerId missing");
That metadata makes DLQ triage way more efficient later.
Delayed retries via scheduled messages
For downstream systems that don’t like being hammered, you can use scheduled messages to implement delayed retries:
// schedule retry in 5 minutes
var scheduledMessage = new ServiceBusMessage(body)
{
MessageId = originalMessageId,
ApplicationProperties =
{
["RetryAttempt"] = attempt + 1
}
};
await sender.ScheduleMessageAsync(scheduledMessage, DateTimeOffset.UtcNow.AddMinutes(5));
Circuit breakers around fragile dependencies
If a downstream system is consistently failing, hammering it from 50 consumers in parallel won’t help. Wrap your outbound calls in a circuit breaker:
var circuitBreaker = Policy
.Handle<Exception>()
.CircuitBreakerAsync(
exceptionsAllowedBeforeBreaking: 10,
durationOfBreak: TimeSpan.FromMinutes(1));
public Task HandleAsync(OrderCreatedMessage msg, CancellationToken ct)
=> circuitBreaker.ExecuteAsync(ct => _ordersService.CreateOrderAsync(msg, ct), ct);
Once open, you can choose to:
- Abandon messages so they’re retried later.
- Dead-letter quickly with a clear “Downstream unavailable” reason and move work to a separate recovery lane.
Preserving ordering and causality without killing throughput
Ordering is expensive. I only guarantee it where the business truly demands it.
When ordering actually matters
Some realistic examples:
- Balance updates for the same account.
- State transitions on the same aggregate (e.g., Order Placed → Paid → Shipped).
- Event streams that feed projections (where out-of-order events break invariants).
Where ordering doesn’t matter (but teams mistakenly think it does):
- Independent operations across different customers or tenants.
- Notification events that can be processed in any order.
- Bulk ingestion tasks where each record is logically independent.
Using sessions for per-key ordering
For per-aggregate ordering, I often combine:
SessionId = AggregateIdfor strict ordering.- Enough aggregate IDs to keep many concurrent sessions open across workers.
In .NET, you’d use ServiceBusSessionProcessor:
var options = new ServiceBusSessionProcessorOptions
{
MaxConcurrentSessions = 16, // how many sessions processed in parallel
MaxConcurrentCallsPerSession = 1, // FIFO per session
AutoCompleteMessages = false,
PrefetchCount = 0
};
var processor = client.CreateSessionProcessor(queueName, options);
processor.ProcessMessageAsync += async args =>
{
var sessionId = args.Message.SessionId;
// handle message for this session in order
};
Partitioning and logical ordering
When you don’t need strict FIFO but want to avoid reordering within some scope, you can:
- Use partition keys (Standard partitioned entities) so messages for a key stay on the same partition.
- Accept eventual ordering: some minor reordering is tolerated, but all operations are still idempotent.
Throughput tuning in .NET: prefetch, concurrency, batch receive
Once correctness is nailed, you can safely turn the knobs for throughput.
ServiceBusProcessor basics
With Azure.Messaging.ServiceBus, the usual pattern is:
var client = new ServiceBusClient(connectionString);
var options = new ServiceBusProcessorOptions
{
MaxConcurrentCalls = 32,
AutoCompleteMessages = false,
PrefetchCount = 128
};
var processor = client.CreateProcessor(queueName, options);
processor.ProcessMessageAsync += ProcessMessageHandler;
processor.ProcessErrorAsync += ErrorHandler;
await processor.StartProcessingAsync();
- MaxConcurrentCalls – how many messages your process handles concurrently.
- PrefetchCount – how many messages to buffer client-side to reduce round trips.
- AutoCompleteMessages = false – you control
CompleteMessageAsynconce business logic is done.
Prefetch: use it, but don’t overshoot
Prefetch can significantly increase throughput by keeping a local buffer of messages.
- Start with PrefetchCount ~ 2–4 × MaxConcurrentCalls.
- Adjust based on latency vs memory trade-offs.
- Be cautious with very long processing times; prefetching too much can lead to many locked-but-unprocessed messages.
Concurrency tuning: CPU vs I/O bound
Ask yourself: is your handler mostly CPU-bound or I/O-bound?
- CPU-bound (heavy computation) – don’t set
MaxConcurrentCallsmuch higher than your core count. - I/O-bound (HTTP, DB, storage) – you can often safely go higher and let async I/O do its thing.
MaxConcurrentCalls to 4 based on early tests. The handler was 95% I/O. Bumping to 64 immediately doubled throughput without stressing CPU.Batch receive and send
For some workloads, especially ETL-style jobs, it’s beneficial to handle messages in batches:
- Receive multiple messages with
ReceiveMessagesAsyncin a loop (when not usingServiceBusProcessor). - Use send batching (
SendMessagesAsyncwith a collection) on the publisher side.
But remember: batching complicates per-message idempotency and error handling, so I use it where the cost savings are worth the complexity.
Resiliency and observability: timeouts, idempotency metrics, tracing
If you can’t see your pipeline misbehaving early, you’ll only notice it once you’ve lost money or data.
Timeout discipline
Every external call in your handler must have:
- A reasonable timeout (HTTP client, DB commands, etc.).
- A retry policy (where appropriate).
- A cancellation path tied to the message handling
CancellationToken.
Long-running handlers increase the chance of message lock expiry, which leads to duplicates even if your code “looks fine”. Consider lock renewal for genuine long tasks, but avoid making every handler long-running by accident.
What to measure for idempotent pipelines
Aside from the usual queue length and DLQ depth, I track:
- Idempotency hit rate – how many messages are skipped as already processed.
- Duplicate attempt count – how many times the same key is seen attempted within a window.
- Handler latency – including breakdown by status (success, transient failure, permanent failure).
- Downstream error rates – per dependency.
Emit metrics from your IIdempotencyStore so you know when duplicate behavior is spiking (often a symptom of timeouts, lock expiry, or buggy publishers).
End-to-end tracing with correlation IDs
For distributed tracing, I usually:
- Propagate traceparent / tracestate (W3C Trace Context) in message headers.
- Include a correlation ID (
CorrelationIdproperty or inApplicationProperties). - Log the same correlation ID across HTTP, Service Bus, and DB logs.
In .NET, when sending:
var activity = Activity.Current; // from System.Diagnostics
var message = new ServiceBusMessage(body)
{
MessageId = businessMessageId,
CorrelationId = activity?.TraceId.ToString() ?? Guid.NewGuid().ToString()
};
if (activity != null)
{
message.ApplicationProperties["traceparent"] = activity.Id;
}
And on the consumer side, start a new activity using traceparent and CorrelationId so traces stitch together in Application Insights or your tracing system.
Sample production-ready pipeline architecture
Let’s pull this together into a concrete example you can map to a real system: say an “Order Processing” backend.
High-level flow
+-------------+ +---------------------+ +-----------------------------+
| API / UI | ---> | OrderCommandTopic | ---> | Order Processor Service(s) |
+-------------+ +---------------------+ +-----------------------------+
| |
| +-----+-----+
| | SQL DB |
| +-----------+
| |
v v
+-------------------+ +---------------+
| OrderEventsTopic | ----> | Other services|
+-------------------+ +---------------+
Focus on the Order Processor Service which consumes commands and mutates state.
Service Bus configuration
- Tier: Premium.
- Entity:
order-commandsqueue (non-session, non-partitioned) or topic + subscription depending on use case. - Duplicate detection: enabled, 1–7 day window.
- MaxDeliveryCount: tuned based on downstream stability (e.g., 10–20).
.NET worker layout
src/
OrderProcessor.Worker/
Program.cs
Messaging/
ServiceBusProcessorHostedService.cs
OrderCreatedHandler.cs
IIdempotencyStore.cs
SqlIdempotencyStore.cs
Infrastructure/
MyDbContext.cs
HttpClients/
Logging/
In Program.cs:
var builder = Host.CreateApplicationBuilder(args);
builder.Services.AddDbContext<MyDbContext>(...);
builder.Services.AddSingleton(new ServiceBusClient(
builder.Configuration["ServiceBus:ConnectionString"]));
builder.Services.AddSingleton<IIdempotencyStore, SqlIdempotencyStore>();
builder.Services.AddHostedService<ServiceBusProcessorHostedService>();
var app = builder.Build();
await app.RunAsync();
The hosted service wires up the processor:
public class ServiceBusProcessorHostedService : IHostedService
{
private readonly ServiceBusClient _client;
private readonly IServiceProvider _services;
private ServiceBusProcessor? _processor;
private readonly ILogger<ServiceBusProcessorHostedService> _logger;
public ServiceBusProcessorHostedService(
ServiceBusClient client,
IServiceProvider services,
ILogger<ServiceBusProcessorHostedService> logger)
{
_client = client;
_services = services;
_logger = logger;
}
public async Task StartAsync(CancellationToken cancellationToken)
{
var options = new ServiceBusProcessorOptions
{
MaxConcurrentCalls = 32,
AutoCompleteMessages = false,
PrefetchCount = 128
};
_processor = _client.CreateProcessor("order-commands", options);
_processor.ProcessMessageAsync += ProcessMessageAsync;
_processor.ProcessErrorAsync += ErrorHandlerAsync;
await _processor.StartProcessingAsync(cancellationToken);
}
private async Task ProcessMessageAsync(ProcessMessageEventArgs args)
{
using var scope = _services.CreateScope();
var handler = scope.ServiceProvider.GetRequiredService<OrderCreatedHandler>();
var body = args.Message.Body.ToObjectFromJson<OrderCreatedMessage>();
await handler.HandleAsync(body, args.Message, args.CancellationToken);
await args.CompleteMessageAsync(args.Message);
}
private Task ErrorHandlerAsync(ProcessErrorEventArgs args)
{
_logger.LogError(args.Exception,
"Service Bus error. Entity: {EntityPath}, ErrorSource: {ErrorSource}",
args.EntityPath, args.ErrorSource);
return Task.CompletedTask;
}
public async Task StopAsync(CancellationToken cancellationToken)
{
if (_processor != null)
{
await _processor.StopProcessingAsync(cancellationToken);
await _processor.DisposeAsync();
}
}
}
Key properties of this design:
- Scoped DI per message via
CreateScope()– clean lifetime management. - Explicit completion after handler success.
- Centralized error handling in
ProcessErrorAsync. - Separate idempotency store used inside
OrderCreatedHandler.
MaxConcurrentCalls and replica count. Idempotency and DLQ behavior keep your data safe.Operational habits, testing strategies, and the traps I’ve seen teams fall into
Operational best practices
- Alert on DLQ growth – not just absolute size, but rate of increase.
- Watch consumer lag – queue length vs historical baseline; use this to guide scale-out.
- Separate “replay” tools – build a small internal tool to browse DLQs and re-submit messages after fixes.
- Version your contracts – be explicit when breaking changes occur; remember that old messages may still be waiting in queues.
Testing idempotency and failure behavior
My rule for any critical handler: if we can’t confidently prove it’s idempotent under weird conditions, it’s not ready.
Tests I insist on:
- Duplicate delivery tests – send the exact same message 2–5 times; assert final DB state is correct and no double side effects occurred.
- Lock expiry simulation – make the handler sleep past lock duration; ensure idempotency logic still protects state on redelivery.
- Partial failure tests – make downstream call fail after DB changes; confirm idempotency handles re-run safely.
- DLQ routing tests – ensure poison messages end up in DLQ with correct reasons and that your replay tooling works.
Common pitfalls I’ve debugged in production
- Using broker
MessageIdas the only dedup key and accidentally regenerating IDs at each send – defeated duplicate detection completely. - Long-running synchronous handlers (blocking I/O, no timeouts) – causing locks to expire, leading to “random” duplicates.
- Idempotency table in a different database than main state – leading to cross-DB distributed transaction issues and inconsistent writes.
- Relying solely on retries for what were actually data quality issues – DLQ got flooded, downstream team got paged for no reason.
- Global FIFO assumptions – making everything sequential and then wondering why throughput was terrible.
The bottom line
High-throughput, idempotent message processing with Azure Service Bus and .NET isn’t about a single magic feature. It’s the combination of:
- Choosing the right Service Bus constructs (queues vs topics, sessions, partitioning, Premium vs Standard).
- Designing idempotent handlers with a clear idempotency key and a durable store.
- Using DLQs, retries, and circuit breakers to handle transient vs permanent failures.
- Tuning concurrency and prefetch for your CPU/I/O profile.
- Investing in observability – correlation IDs, metrics for duplicates, DLQ monitoring.
Get those right, and you can safely crank up load, scale out workers, and sleep a lot better when your traffic spikes. Ignore them, and your “fast” pipeline will be the fastest way to corrupt data that you’ve ever deployed.
If you’re building or refactoring a pipeline today, start by making one handler provably idempotent end-to-end. Once that’s solid, scale it out and apply the same patterns everywhere else.