Skip to main content

0042: Workflow Error Handling and Compensation Strategy

Date: 2025-11-08

Status: Proposed

Context

Long-running business processes are prone to failure. An "Activity" might fail due to a transient network issue, a bug in a downstream service, or a permanent business logic failure. Furthermore, a multi-step process might succeed for several steps before a later step fails, leaving the system in an inconsistent state.

For example, in an onboarding workflow, we might successfully create a user in the iam-service but then fail to provision their subscription in the billing-service. We need a strategy to automatically handle transient errors and a pattern to gracefully roll back the successfully completed steps.

Decision

We will adopt a two-pronged strategy using Temporal's built-in features.

  1. Automatic Retries for Transient Errors: All Activities will be configured with a default, robust retry policy. This policy will handle transient, temporary failures (e.g., network timeouts, deadlocks, temporary service unavailability) by automatically retrying the Activity with an exponential backoff. This makes workflows resilient to common infrastructure-level issues without any custom logic.

  2. Saga Pattern for Business-Level Compensation: For failures that are permanent or require business-level rollback, we will implement the Saga pattern directly within the workflow definition.

    • The workflow will orchestrate a sequence of Activities (e.g., CreateUser, ProvisionSubscription, SendWelcomeEmail).
    • If a critical Activity fails permanently (i.e., it exhausts its retry policy), the workflow's error handling logic will catch the exception.
    • The error handling block will then execute a series of compensating Activities in the reverse order of the original operations (e.g., DeleteSubscription, DisableUser).
    • Temporal's durable execution guarantees that this compensation logic will run to completion, even if the worker process fails.

Consequences

Positive

  • Resilience: Workflows become highly resilient to transient failures automatically.
  • Consistency: The Saga pattern provides a clear and reliable way to maintain data consistency across multiple services in the event of a failure.
  • Durability: The state of the Saga is managed by Temporal, so the compensation logic is just as durable as the forward-execution logic.
  • Testability: The entire Saga, including the failure and compensation paths, can be unit-tested using the Temporal testing framework by mocking activity failures.

Negative

  • Increased Workflow Complexity: The workflow definition becomes more complex as it must include try/catch blocks and calls to compensation activities.
  • Compensation Logic Required: For every action that has a side effect, a corresponding compensating action must be created, tested, and maintained. This requires careful design to ensure compensations are idempotent and reliable.