I have a long history of shipping background processing at scale – everything from small worker jobs that run nightly to high throughput event-driven pipelines that process thousands of messages per minute. In this post I collect the patterns that actually worked for me in production, the mistakes that burned us, and pragmatic advice for building resilient consumers using Azure Service Bus and .NET. Expect code, concrete tradeoffs, and the sort of shoulder-scar stories you only get after debugging at 03:00 on a Friday.
\n\n
Why Azure Service Bus – and why not
\n
Azure Service Bus is not the cheapest or the simplest queue. What it gives you is predictable semantics: durable broker, at-least-once delivery, sessions, dead letter queues, deferred messages, transactional send/receive via the Service Bus transaction unit, and first-class support in the Azure SDK. Use it when you need those features – otherwise a simple storage queue or Kafka variant may be better.
\n\n
\n\n
Design goals I aim for
\n
- \n
- End-to-end reliability – no lost messages in normal failure modes
- Predictable retries and backoff – to avoid thundering herd and double work
- Observability – metrics and tracing so I can tell why a job failed
- Idempotency – safe reprocessing
- Operational control – pause, replay, move to DLQ, inspect payloads
\n
\n
\n
\n
\n
\n\n
Core building blocks
\n
- \n
- Producer: sends messages with correlation metadata and business keys
- Outbox pattern: for atomicity between DB and message production
- Consumer: processes messages using bounded concurrency and idempotency checks
- DLQ & poison handling: move hard failures to DLQ with diagnostic metadata
- Monitoring: Prometheus/Grafana or Application Insights metrics and traces
\n
\n
\n
\n
\n
\n\n
An ASCII diagram I often draw in meetings
\n
+--------+ +---------+ +-----------+ +-----------+\n | Client | --> | API App | ---> | SQL DB | ---> | Outbox |\n +--------+ +---------+ +-----------+ +-----------+\n | |\n v v\n +------------+ +------------+\n | Dispatcher | | ServiceBus |\n +------------+ +------------+\n | |\n v v\n +------+ +------+\n | Worker| | DLQ |\n +------+ +------+\n
\n\n
Pattern 1 – Outbox with transactional safety
\n
This is the single most production-safety improvement I recommend. If your API writes to a relational DB and must publish a message, use an outbox table inside the same transaction. A background dispatcher reads the outbox and sends to Service Bus. This avoids the classic dual-write problem where your DB commit succeeds but the message send fails – or vice versa.
\n\n
CREATE TABLE Outbox (\n Id UNIQUEIDENTIFIER PRIMARY KEY,\n AggregateId UNIQUEIDENTIFIER NOT NULL,\n Payload NVARCHAR(MAX) NOT NULL,\n Type NVARCHAR(200) NOT NULL,\n OccurredUtc DATETIME2 NOT NULL DEFAULT SYSUTCDATETIME(),\n Sent BIT NOT NULL DEFAULT 0,\n SentUtc DATETIME2 NULL\n);\n
\n\n
\n\n
Pattern 2 – Idempotency and idempotency keys
\n
Service Bus guarantees at-least-once delivery. That means your handler must be idempotent or make checks before applying side effects. Two practical approaches:
\n
- \n
- Idempotency table keyed by business key or message id. Insert-if-not-exists to detect replays.
- Detect duplicates by checking the target resource state before mutating it – e.g. ‘if invoice already marked paid, return success’.
\n
\n
\n\n
// example idempotency insert using Dapper\nvar sql = "INSERT INTO ProcessedMessages(MessageId, ProcessedUtc) VALUES(@id, GETUTCDATE())";\ntry\n{\n var rows = await db.ExecuteAsync(sql, new { id = messageId });\n if (rows == 0) return AlreadyProcessed();\n}\ncatch (SqlException ex) when (ex.Number == 2627) // PK violation\n{\n return AlreadyProcessed();\n}\n
\n\n
\n\n
Pattern 3 – Controlled retries, exponential backoff, and poison handling
\n
Two retry levels are useful: application-level retries with increasing delay for transient errors, and broker-level delivery attempts. Configure Service Bus retry behaviour conservatively and implement your own retry policy when you need fine control.
\n\n
- \n
- Transient errors – retry with exponential backoff and jitter
- Business errors – do not retry; move to DLQ quickly with reason
- Poison messages – after N delivery attempts, dead-letter with diagnostic metadata
\n
\n
\n
\n\n
// pseudocode consumer pattern (simplified)\nprotected override async Task ProcessMessageAsync(ProcessMessageEventArgs args)\n{\n var messageId = args.Message.MessageId;\n try\n {\n await EnsureIdempotent(messageId);\n await HandleBusinessLogic(args.Message);\n await args.CompleteMessageAsync(args.Message);\n }\n catch (TransientException)\n {\n // abandon so Service Bus redelivers. Consider deferring with delay if needed.\n await args.AbandonMessageAsync(args.Message);\n }\n catch (BusinessException bEx)\n {\n // move to DLQ with reason\n await args.DeadLetterMessageAsync(args.Message, "BusinessError", bEx.Message);\n }\n catch (Exception ex)\n {\n // unknown - increment delivery count and decide\n if (args.Message.DeliveryCount >= 5)\n await args.DeadLetterMessageAsync(args.Message, "ExceededRetries", ex.Message);\n else\n await args.AbandonMessageAsync(args.Message);\n }\n}\n
\n\n
\n\n
Pattern 4 – Concurrency limits and partitioning
\n
Unbounded concurrent message handlers are an easy way to overload downstream resources like SQL or external APIs. Set a sane concurrency cap and tune it against resource limits.
\n\n
- \n
- Start with 2-4 message handlers per CPU core and use a circuit-breaker for downstream failures
- Use sessions for ordered processing when business requires single-threaded handling per key
- Use partition keys to spread load across Service Bus partitions if you need scale
\n
\n
\n
\n\n
\n\n
Operational patterns – DLQ, move, pause, replay
\n
Operators must be able to inspect failing messages, fix the payload or the consumer, and replay messages. My operational playbook:
\n
- \n
- Capture failed message + metadata into a diagnostics store on dead-letter
- Provide a small admin UI to inspect and re-publish messages after fixes
- Support a pause toggle for the dispatcher so you can stop new sends while fixing the consumer
\n
\n
\n
\n\n
\n\n
Telemetry – what I always surface
\n
| Metric | Why it matters |
|---|---|
| Messages/sec in | Capacity planning |
| Processing latency (p50, p95, p99) | Backpressure detection |
| DeliveryCount distribution | Poison message detection |
| DLQ size and age | Operational alerts |
| Consumer error rate | Regression detection |
\n\n
Correlate messages via CorrelationId and propagate trace context using W3C traceparent headers so your distributed traces connect end-to-end.
\n\n
Code snippet – a small, production-grade consumer bootstrap
\n
var client = new ServiceBusClient(connectionString);\nvar processor = client.CreateProcessor(queueName, new ServiceBusProcessorOptions\n{\n MaxConcurrentCalls = 8,\n AutoCompleteMessages = false\n});\n\nprocessor.ProcessMessageAsync += async args =>\n{\n var logger = scope.ServiceProvider.GetRequiredService();\n var messageId = args.Message.MessageId;\n\n try\n {\n await IdempotencyGuard.ExecuteIfNotProcessed(db, messageId, async () =>\n {\n // domain handler that may use EF Core, external HTTP, etc.\n await handler.Handle(JsonSerializer.Deserialize(args.Message.Body));\n });\n\n await args.CompleteMessageAsync(args.Message);\n }\n catch (TransientException tex)\n {\n logger.LogWarning(tex, "Transient failure - abandoning");\n await args.AbandonMessageAsync(args.Message);\n }\n catch (Exception ex)\n {\n logger.LogError(ex, "Unhandled failure");\n if (args.Message.DeliveryCount >= 5)\n await args.DeadLetterMessageAsync(args.Message, "Unhandled", ex.Message);\n else\n await args.AbandonMessageAsync(args.Message);\n }\n};\n\nprocessor.ProcessErrorAsync += args =>\n{\n // log broker-level issues\n logger.LogError(args.Exception, "Processor error");\n return Task.CompletedTask;\n};\n\nawait processor.StartProcessingAsync();\n
\n\n
\n\n
Failure story – what burned me
\n
In one project we used AutoCompleteMessages = true and optimistic processing. A downstream HTTP call intermittently timed out. The message was marked complete because the exception was swallowed. The result – a silent data loss pattern – surfaced only when users noticed missing downstream state. Fix: set AutoCompleteMessages = false, make handler transactional with idempotency guard, and add improved logs including message body hash and correlation id.
\n\n
Testing and local development ergonomics
\n
- \n
- Use the Azure Service Bus emulator for basic tests, but prefer a dedicated test namespace for integration tests
- Write consumer unit tests by mocking the handler logic and testing idempotency and retry logic separately
- For load tests, push messages directly to the broker and measure end-to-end processing under realistic concurrency
\n
\n
\n
\n\n
When to favour other approaches
\n
Service Bus is great but not always the answer:
\n
- \n
- High-volume event streaming – choose Kafka or Event Hubs
- Simple, low-throughput background jobs – Azure Storage Queue is cheaper and simpler
- Transactional, relational change capture – consider change feed or a streaming platform
\n
\n
\n
\n\n
Final thoughts
\n
Resilient background processing is more about operational hygiene than fancy code. The techniques I reach for most are simple: outbox for atomicity, idempotency guards, reasonable concurrency, sensible retries and dead-letter handling, and observability that tells a story the moment something goes wrong. These patterns have kept us out of ugly postmortems more than once.
\n\n
\n\n
The next time you design a consumer pipeline, sketch the failure paths and ask: how would I recover at 03:00 when an engineer who is local to the time zone is not available? If you can answer that, you are on the right track.
\n\n
\n