Skip to main content

when waiting for an answer stops being a good idea

·690 words·4 mins
Kerman Sanjuan Malax-Echevarria
Author
Kerman Sanjuan Malax-Echevarria
Cloud Engineer & DevOps

This week I gave a talk at the Speaker Junior mentoring program about something I’ve been running into in real projects for a while: the exact moment when waiting for a synchronous response stops making sense. To make it concrete, I built a full demo on Azure. Real infrastructure, real code. This post walks through how it works and why the decisions matter.

The scenario is a simple order system. A client submits an order. Three things need to happen: Inventory reserves the stock, Billing charges the customer, Notification sends a confirmation. In a synchronous model, those three calls are chained one after another. The total response time is the sum of all three. If Billing takes five seconds and Notification two, the client waits seven seconds before getting a response. And that’s before anything goes wrong. If Billing throws an exception, the whole order fails and Notification never runs.

The async version flips this. The API receives the order, writes the initial state to Cosmos DB, publishes a single event to Azure Service Bus, and returns 202 Accepted in about 100 milliseconds. The client is done. The three services pick up the event independently, in parallel, each from its own subscription on the same topic.

EDA architecture: Portal → Order API → Service Bus → Inventory Fn, Billing Fn, Notification Fn → Cosmos DB
Portal → Order API publishes to Service Bus → three Functions consume in parallel → all patch Cosmos DB.

The endpoint looks like this. Notice what it leaves out: no direct calls to Inventory, Billing, or Notification.

@app.post("/orders", response_model=CreateOrderResponse, status_code=202)
async def create_order(request: CreateOrderRequest):
    order_id = str(uuid.uuid4())
    event_id = str(uuid.uuid4())

    # Write initial state to Cosmos DB — source of truth for the portal
    await write_initial_state(order_id, request)

    # Publish one event — Service Bus fans it out to all three subscriptions
    event = OrderCreated(order_id=order_id, event_id=event_id, ...)
    await publish_order_created(event)

    # Return immediately — 202 means "accepted", not "done"
    return CreateOrderResponse(order_id=order_id)

The publishing side uses DefaultAzureCredential. No secrets in the code, it works with Managed Identity in production and az login locally. The correlation_id travels in application_properties of every Service Bus message, so you can filter the entire flow of a specific order in App Insights with a single query.

message = ServiceBusMessage(
    body=json.dumps(event.to_dict()),
    content_type="application/cloudevents+json",
    application_properties={
        "event_type": event.event_type,
        "order_id": event.order_id,
        "correlation_id": event.event_id,  # ties all three consumers together in logs
    },
    message_id=event.event_id,  # enables deduplication if activated on the namespace
)
await sender.send_messages(message)

Each consumer is an Azure Function v2 triggered by its own subscription. The Billing one looks like this:

@app.service_bus_topic_trigger(
    arg_name="msg",
    topic_name="orders",
    subscription_name="billing-sub",
    connection="AzureWebJobsServiceBus",
)
async def billing_function(msg: func.ServiceBusMessage) -> None:
    props = msg.application_properties or {}
    order_id = props.get("order_id", "unknown")
    correlation_id = props.get("correlation_id", "unknown")

    # Idempotency check — at-least-once delivery means duplicates are possible
    if await _check_already_processed(order_id, "billing"):
        return

    await _update_status(order_id, "billing", "processing")
    # ... process payment ...
    await _update_status(order_id, "billing", "done")

The _update_status helper uses Cosmos DB’s patch_item. It only touches the /status/billing path in the document, so all three Functions can write their own status concurrently without stepping on each other.

await container.patch_item(
    item=order_id,
    partition_key=order_id,
    patch_operations=[
        {"op": "replace", "path": "/status/billing", "value": status}
    ],
)

For the second demo I wanted to show what happens when Billing fails. I added a force_billing_fail property to the message that the portal can toggle on and off. When it’s true, the Function raises an exception on purpose. Service Bus retries three times, then moves the message to the Dead Letter Queue. Inventory and Notification are completely unaffected. They consumed their own copies of the event from their own subscriptions and finished fine. The order isn’t lost. Fix the issue, requeue from the DLQ, and Billing processes correctly.

That last bit is the hardest to internalize without seeing it live. A failure in one consumer doesn’t cascade. The message sits in the DLQ waiting. In a synchronous chain, a failure at step two means step three never runs and the client gets a 500. Here, the client already got a 202 and two out of three things worked fine.

The full infrastructure is Terraform — Service Bus namespace, topic with three subscriptions, three Function Apps, Cosmos DB, and App Insights all wired together. Everything is in the repo if you want to read the actual code: Kerman-Sanjuan/event-driven-acl.