Building resilient background processing with Azure Service Bus and .NET

I have a long history of shipping background processing at scale – everything from small worker jobs that run nightly to high throughput event-driven pipelines that process thousands of messages per minute. In this post I collect the patterns that actually worked for me in production, the mistakes that burned us, and pragmatic advice for building resilient consumers using Azure Service Bus and .NET. Expect code, concrete tradeoffs, and the sort of shoulder-scar stories you only get after debugging at 03:00 on a Friday.

\n\n

Why Azure Service Bus – and why not

Azure Service Bus is not the cheapest or the simplest queue. What it gives you is predictable semantics: durable broker, at-least-once delivery, sessions, dead letter queues, deferred messages, transactional send/receive via the Service Bus transaction unit, and first-class support in the Azure SDK. Use it when you need those features – otherwise a simple storage queue or Kafka variant may be better.

\n\n

Architecture insight: Choose a queue for the semantics you need, not the logo on the front page. Service Bus wins for ordered processing, transactions, and advanced dead-lettering.

\n\n

Design goals I aim for

End-to-end reliability – no lost messages in normal failure modes

Predictable retries and backoff – to avoid thundering herd and double work

Observability – metrics and tracing so I can tell why a job failed

Idempotency – safe reprocessing

Operational control – pause, replay, move to DLQ, inspect payloads

\n\n

Core building blocks

Producer: sends messages with correlation metadata and business keys

Outbox pattern: for atomicity between DB and message production

Consumer: processes messages using bounded concurrency and idempotency checks

DLQ & poison handling: move hard failures to DLQ with diagnostic metadata

Monitoring: Prometheus/Grafana or Application Insights metrics and traces

\n\n

An ASCII diagram I often draw in meetings

 +--------+ +---------+ +-----------+ +-----------+\n | Client | --> | API App | ---> | SQL DB | ---> | Outbox |\n +--------+ +---------+ +-----------+ +-----------+\n | |\n v v\n +------------+ +------------+\n | Dispatcher | | ServiceBus |\n +------------+ +------------+\n | |\n v v\n +------+ +------+\n | Worker| | DLQ |\n +------+ +------+\n

\n\n

Pattern 1 – Outbox with transactional safety

This is the single most production-safety improvement I recommend. If your API writes to a relational DB and must publish a message, use an outbox table inside the same transaction. A background dispatcher reads the outbox and sends to Service Bus. This avoids the classic dual-write problem where your DB commit succeeds but the message send fails – or vice versa.

\n\n

CREATE TABLE Outbox (\n Id UNIQUEIDENTIFIER PRIMARY KEY,\n AggregateId UNIQUEIDENTIFIER NOT NULL,\n Payload NVARCHAR(MAX) NOT NULL,\n Type NVARCHAR(200) NOT NULL,\n OccurredUtc DATETIME2 NOT NULL DEFAULT SYSUTCDATETIME(),\n Sent BIT NOT NULL DEFAULT 0,\n SentUtc DATETIME2 NULL\n);\n

\n\n

Best practice: Keep the outbox schema small and compact. Store a type discriminator and the raw JSON payload. Index OccurredUtc and Sent for efficient scanning.

\n\n

Pattern 2 – Idempotency and idempotency keys

Service Bus guarantees at-least-once delivery. That means your handler must be idempotent or make checks before applying side effects. Two practical approaches:

Idempotency table keyed by business key or message id. Insert-if-not-exists to detect replays.

Detect duplicates by checking the target resource state before mutating it – e.g. ‘if invoice already marked paid, return success’.

\n\n

// example idempotency insert using Dapper\nvar sql = "INSERT INTO ProcessedMessages(MessageId, ProcessedUtc) VALUES(@id, GETUTCDATE())";\ntry\n{\n var rows = await db.ExecuteAsync(sql, new { id = messageId });\n if (rows == 0) return AlreadyProcessed();\n}\ncatch (SqlException ex) when (ex.Number == 2627) // PK violation\n{\n return AlreadyProcessed();\n}\n

\n\n

Warning: Relying on non-atomic checks like SELECT then INSERT opens a race. Use an atomic insert or unique constraint to guarantee single-writer semantics.

\n\n

Pattern 3 – Controlled retries, exponential backoff, and poison handling

Two retry levels are useful: application-level retries with increasing delay for transient errors, and broker-level delivery attempts. Configure Service Bus retry behaviour conservatively and implement your own retry policy when you need fine control.

\n\n

Transient errors – retry with exponential backoff and jitter

Business errors – do not retry; move to DLQ quickly with reason

Poison messages – after N delivery attempts, dead-letter with diagnostic metadata

\n\n

// pseudocode consumer pattern (simplified)\nprotected override async Task ProcessMessageAsync(ProcessMessageEventArgs args)\n{\n var messageId = args.Message.MessageId;\n try\n {\n await EnsureIdempotent(messageId);\n await HandleBusinessLogic(args.Message);\n await args.CompleteMessageAsync(args.Message);\n }\n catch (TransientException)\n {\n // abandon so Service Bus redelivers. Consider deferring with delay if needed.\n await args.AbandonMessageAsync(args.Message);\n }\n catch (BusinessException bEx)\n {\n // move to DLQ with reason\n await args.DeadLetterMessageAsync(args.Message, "BusinessError", bEx.Message);\n }\n catch (Exception ex)\n {\n // unknown - increment delivery count and decide\n if (args.Message.DeliveryCount >= 5)\n await args.DeadLetterMessageAsync(args.Message, "ExceededRetries", ex.Message);\n else\n await args.AbandonMessageAsync(args.Message);\n }\n}\n

\n\n

Debugging insight: I once had a malformed message that repeatedly failed validation. Without a delivery cap it retried endlessly until workers saturated the app. Add delivery limits and proactive dead-lettering.

\n\n

Pattern 4 – Concurrency limits and partitioning

Unbounded concurrent message handlers are an easy way to overload downstream resources like SQL or external APIs. Set a sane concurrency cap and tune it against resource limits.

\n\n

Start with 2-4 message handlers per CPU core and use a circuit-breaker for downstream failures

Use sessions for ordered processing when business requires single-threaded handling per key

Use partition keys to spread load across Service Bus partitions if you need scale

\n\n

Tip: Measure end-to-end latency under load. Concurrency that looks fine on local tests often causes deadlocks or timeouts in production.

\n\n

Operational patterns – DLQ, move, pause, replay

Operators must be able to inspect failing messages, fix the payload or the consumer, and replay messages. My operational playbook:

Capture failed message + metadata into a diagnostics store on dead-letter

Provide a small admin UI to inspect and re-publish messages after fixes

Support a pause toggle for the dispatcher so you can stop new sends while fixing the consumer

\n\n

Best practice: Never delete DLQ messages immediately. Keep them for a retention window long enough for investigation – 7 to 30 days depending on compliance.

\n\n

Telemetry – what I always surface

Metric	Why it matters
Messages/sec in	Capacity planning
Processing latency (p50, p95, p99)	Backpressure detection
DeliveryCount distribution	Poison message detection
DLQ size and age	Operational alerts
Consumer error rate	Regression detection

\n\n

Correlate messages via CorrelationId and propagate trace context using W3C traceparent headers so your distributed traces connect end-to-end.

\n\n

Code snippet – a small, production-grade consumer bootstrap

var client = new ServiceBusClient(connectionString);\nvar processor = client.CreateProcessor(queueName, new ServiceBusProcessorOptions\n{\n MaxConcurrentCalls = 8,\n AutoCompleteMessages = false\n});\n\nprocessor.ProcessMessageAsync += async args =>\n{\n var logger = scope.ServiceProvider.GetRequiredService();\n var messageId = args.Message.MessageId;\n\n try\n {\n await IdempotencyGuard.ExecuteIfNotProcessed(db, messageId, async () =>\n {\n // domain handler that may use EF Core, external HTTP, etc.\n await handler.Handle(JsonSerializer.Deserialize(args.Message.Body));\n });\n\n await args.CompleteMessageAsync(args.Message);\n }\n catch (TransientException tex)\n {\n logger.LogWarning(tex, "Transient failure - abandoning");\n await args.AbandonMessageAsync(args.Message);\n }\n catch (Exception ex)\n {\n logger.LogError(ex, "Unhandled failure");\n if (args.Message.DeliveryCount >= 5)\n await args.DeadLetterMessageAsync(args.Message, "Unhandled", ex.Message);\n else\n await args.AbandonMessageAsync(args.Message);\n }\n};\n\nprocessor.ProcessErrorAsync += args =>\n{\n // log broker-level issues\n logger.LogError(args.Exception, "Processor error");\n return Task.CompletedTask;\n};\n\nawait processor.StartProcessingAsync();\n

\n\n

Anti-pattern: Relying on AutoCompleteMessages = true when your handler touches external systems. You will end up with acknowledged but partially-processed work.

\n\n

Failure story – what burned me

In one project we used AutoCompleteMessages = true and optimistic processing. A downstream HTTP call intermittently timed out. The message was marked complete because the exception was swallowed. The result – a silent data loss pattern – surfaced only when users noticed missing downstream state. Fix: set AutoCompleteMessages = false, make handler transactional with idempotency guard, and add improved logs including message body hash and correlation id.

\n\n

Testing and local development ergonomics

Use the Azure Service Bus emulator for basic tests, but prefer a dedicated test namespace for integration tests

Write consumer unit tests by mocking the handler logic and testing idempotency and retry logic separately

For load tests, push messages directly to the broker and measure end-to-end processing under realistic concurrency

\n\n

When to favour other approaches

Service Bus is great but not always the answer:

High-volume event streaming – choose Kafka or Event Hubs

Simple, low-throughput background jobs – Azure Storage Queue is cheaper and simpler

Transactional, relational change capture – consider change feed or a streaming platform

\n\n

Final thoughts

Resilient background processing is more about operational hygiene than fancy code. The techniques I reach for most are simple: outbox for atomicity, idempotency guards, reasonable concurrency, sensible retries and dead-letter handling, and observability that tells a story the moment something goes wrong. These patterns have kept us out of ugly postmortems more than once.

\n\n

Tip: Start with the smallest set of guarantees you need and add complexity only when you hit real problems. Prematurely distributed systems are just a faster way to get new failure modes.

\n\n

The next time you design a consumer pipeline, sketch the failure paths and ask: how would I recover at 03:00 when an engineer who is local to the time zone is not available? If you can answer that, you are on the right track.

\n\n

— Asad

Join My Developer Community

Building resilient background processing with Azure Service Bus and .NET

Why Azure Service Bus – and why not

Design goals I aim for

Core building blocks

An ASCII diagram I often draw in meetings

Pattern 1 – Outbox with transactional safety

Pattern 2 – Idempotency and idempotency keys

Pattern 3 – Controlled retries, exponential backoff, and poison handling

Pattern 4 – Concurrency limits and partitioning

Operational patterns – DLQ, move, pause, replay

Telemetry – what I always surface

Code snippet – a small, production-grade consumer bootstrap

Failure story – what burned me

Testing and local development ergonomics

When to favour other approaches

Final thoughts

Enjoying the article?

On this page