Skip to main content

0055: Asynchronous Notification Processing

Date: 2025-12-04

Status: Accepted

Context

The notification-service is responsible for delivering messages via external providers (Email, SMS) and internal channels (WebSockets). These operations are inherently I/O bound, prone to variable latency, and subject to external failures (e.g., an SMTP server timeout or a rate-limited API).

If the service were to process these requests synchronously, a slow third-party provider could block the calling service (e.g., iam-service waiting to send a password reset), leading to cascading failures and a poor user experience. Furthermore, if the service crashes during a synchronous call, the notification might be lost.

Decision

We will implement a strictly Asynchronous, Queue-Based Architecture for the notification-service.

  1. API Behavior: The public API endpoint (POST /notify) will not attempt to send the notification immediately.

    • It will validate the request payload.
    • It will publish a NotificationJob to a durable internal message queue (NATS).
    • It will immediately return a 202 Accepted HTTP status to the caller.
  2. Worker Process: A separate worker process (running within the same binary in hybrid mode, or separately in K8s) will subscribe to the queue.

    • It consumes the NotificationJob.
    • It executes the logic to render the template and dispatch the message via the appropriate adapter.
    • It handles retries for transient failures (e.g., network timeouts).
  3. Reliability: The message queue serves as a buffer. If the worker is down or overwhelmed, jobs accumulate in the queue rather than being dropped or blocking the API.

Consequences

Positive

  • High Availability & Resilience: The API remains responsive even if downstream providers (SendGrid, Twilio) are down.
  • Decoupling: Callers (like admin-bff) do not need to handle retry logic for email delivery; they just "fire and forget."
  • Scalability: We can scale the API layer (to accept requests) independently from the Worker layer (to process heavy I/O).
  • Throttling: The worker can process jobs at a controlled rate to respect provider rate limits, smoothing out traffic spikes.

Negative

  • Eventual Consistency: The caller does not know immediately if the email was actually delivered, only that it was accepted for delivery.
  • Complexity: Requires managing a message queue infrastructure (NATS) and implementing a worker loop.
  • Error Visibility: Delivery failures must be reported asynchronously (e.g., via logs or a status callback), as they cannot be returned in the HTTP response.