0055: Asynchronous Notification Processing
Date: 2025-12-04
Status: Accepted
Context
The notification-service is responsible for delivering messages via external providers (Email, SMS) and internal channels (WebSockets). These operations are inherently I/O bound, prone to variable latency, and subject to external failures (e.g., an SMTP server timeout or a rate-limited API).
If the service were to process these requests synchronously, a slow third-party provider could block the calling service (e.g., iam-service waiting to send a password reset), leading to cascading failures and a poor user experience. Furthermore, if the service crashes during a synchronous call, the notification might be lost.
Decision
We will implement a strictly Asynchronous, Queue-Based Architecture for the notification-service.
-
API Behavior: The public API endpoint (
POST /notify) will not attempt to send the notification immediately.- It will validate the request payload.
- It will publish a
NotificationJobto a durable internal message queue (NATS). - It will immediately return a
202 AcceptedHTTP status to the caller.
-
Worker Process: A separate worker process (running within the same binary in hybrid mode, or separately in K8s) will subscribe to the queue.
- It consumes the
NotificationJob. - It executes the logic to render the template and dispatch the message via the appropriate adapter.
- It handles retries for transient failures (e.g., network timeouts).
- It consumes the
-
Reliability: The message queue serves as a buffer. If the worker is down or overwhelmed, jobs accumulate in the queue rather than being dropped or blocking the API.
Consequences
Positive
- High Availability & Resilience: The API remains responsive even if downstream providers (SendGrid, Twilio) are down.
- Decoupling: Callers (like
admin-bff) do not need to handle retry logic for email delivery; they just "fire and forget." - Scalability: We can scale the API layer (to accept requests) independently from the Worker layer (to process heavy I/O).
- Throttling: The worker can process jobs at a controlled rate to respect provider rate limits, smoothing out traffic spikes.
Negative
- Eventual Consistency: The caller does not know immediately if the email was actually delivered, only that it was accepted for delivery.
- Complexity: Requires managing a message queue infrastructure (NATS) and implementing a worker loop.
- Error Visibility: Delivery failures must be reported asynchronously (e.g., via logs or a status callback), as they cannot be returned in the HTTP response.