Skip to main content

0039: Workflow Engine Selection

Date: 2025-11-08

Status: Proposed

Context

The Citadel platform needs to orchestrate complex, long-running business processes that span multiple microservices (e.g., tenant onboarding, order fulfillment). Implementing this orchestration logic within individual services leads to high coupling, duplicated effort, and poor visibility into the state of a process.

We need a dedicated, durable orchestration engine that can manage state, handle retries, and provide visibility for these long-running workflows. The primary candidates considered were modern, code-first workflow engines that align with our polyglot environment (Go, Python).

  • Temporal: A popular open-source platform for durable execution. Strong SDKs for Go and Python.
  • Camunda: A well-established engine, often associated with BPMN for visual modeling, but also supports code-first approaches.
  • Netflix Conductor: Another strong contender, but with a less mature ecosystem compared to Temporal.

Decision

We will adopt Temporal as the core workflow engine for the workflow-service.

The workflow-service will act as a thin API layer and a host for "worker" processes that execute workflow logic defined in code using the Temporal SDK. It will not reinvent orchestration but will provide a standardized way for other Citadel services to interact with the Temporal cluster.

Temporal is chosen for its excellent SDK support in both Go and Python, its proven scalability, its strong developer community, and its code-first approach which aligns well with our TDD principles.

Consequences

Positive

  • Durability & Reliability: Temporal provides durable, stateful workflows that are resilient to process and worker failures.
  • Developer Experience: The code-first approach allows developers to write complex orchestration logic in familiar languages (Go, Python) and test it with standard unit testing frameworks.
  • Scalability: The Temporal server is horizontally scalable, and workers can be scaled independently.
  • Observability: Provides excellent visibility into running and completed workflows through its web UI and APIs.

Negative

  • New Infrastructure Component: Adds a new, complex, and stateful component (the Temporal cluster) to our infrastructure that must be deployed, monitored, and maintained.
  • Learning Curve: Developers must learn the Temporal programming model, including concepts like workflows, activities, and determinism constraints.
  • Operational Overhead: Managing a production-grade Temporal cluster requires operational expertise.