Runbook: iam-service
This document provides operational guidance for the iam-service. It is intended for on-call engineers and system administrators responsible for maintaining the service's health.
Service Overview
- Purpose: Manages Citadel-specific authorization policies (tenants, users, attributes) and acts as a Policy Information Point (PIP) for downstream services.
- Architecture: See System Overview and IAM Service Design.
- Repository:
apps/iam-service
Key Dependencies
- PostgreSQL Database: Stores all policy information (tenants, users, attributes). This is the single source of truth for authorization policy.
- NATS / Kafka (Event Bus): Used to publish domain events (e.g.,
UserCreated,TenantCreated). While an event bus outage won't cause the API to fail (due to the Outbox pattern), it will prevent downstream services from reacting to IAM events.
Monitoring & Key Metrics
http_requests_total: Measures the rate of incoming API requests.http_request_duration_seconds: Tracks the latency of request handling. Spikes can indicate database performance issues.database_errors_total: A custom metric counting failures during database operations. Any value greater than zero requires immediate investigation.event_bus_errors_total: Counts failures when the Outbox worker attempts to publish events to the message broker (NATS/Kafka).outbox_pending_count: Tracks the number of events waiting in the outbox table. A growing number indicates the publisher is stuck or the broker is down.
Common Alerts & Troubleshooting
Alert: High Latency on API Endpoints
- Symptom: Users experience slow response times. The
http_request_duration_secondsmetric is high. - Check Database: Investigate the performance of the PostgreSQL instance. Are there slow queries? Is the CPU or connection pool saturated?
# (Example) Connect to the database and check for long-running queries
kubectl exec -it <postgres-pod-name> -- psql -U <user> -d <db> -c "SELECT * FROM pg_stat_activity WHERE state = 'active';" - Action: Optimize slow queries or scale the database instance if it is under-provisioned.
Alert: High Outbox Pending Count
- Symptom: The
outbox_pending_countmetric is increasing. Downstream services are not receiving updates. - Check Event Bus Connectivity: Verify that the
iam-servicecan reach the NATS or Kafka broker. - Check Logs: Look for error messages related to the
EventBusAdapterorOutboxWorker.kubectl logs -l app=iam-service | grep "failed to publish event" - Action: Restore connectivity to the message broker. The Outbox worker will automatically retry publishing events once the connection is restored.
Alert: Service is Crash-Looping (CrashLoopBackOff)
- Symptom:
kubectl get podsshows theiam-servicepod is constantly restarting. - Check Logs for Fatal Errors: The most common cause of a startup failure is the inability to connect to a critical dependency.
# Get logs from the previously terminated container
kubectl logs <pod-name> --previous - Common Causes:
- Invalid Database Connection String: The
DATABASE_URLenvironment variable is incorrect. - Invalid Event Bus Configuration: The
EVENT_BUS_URLor related variables are incorrect.
- Invalid Database Connection String: The
- Action: Check the
ConfigMapandSecretmounted by the deployment. Correct any invalid values and re-apply the deployment.
Deployment & Rollback
- Deployment: Handled automatically by the CI/CD pipeline on merge to the
mainbranch. - Rollback: Use the standard Kubernetes rollback procedure if a bad deployment occurs.
kubectl rollout undo deployment/iam-service