Skip to main content

Runbook: iam-service

This document provides operational guidance for the iam-service. It is intended for on-call engineers and system administrators responsible for maintaining the service's health.

Service Overview

  • Purpose: Manages Citadel-specific authorization policies (tenants, users, attributes) and acts as a Policy Information Point (PIP) for downstream services.
  • Architecture: See System Overview and IAM Service Design.
  • Repository: apps/iam-service

Key Dependencies

  • PostgreSQL Database: Stores all policy information (tenants, users, attributes). This is the single source of truth for authorization policy.
  • NATS / Kafka (Event Bus): Used to publish domain events (e.g., UserCreated, TenantCreated). While an event bus outage won't cause the API to fail (due to the Outbox pattern), it will prevent downstream services from reacting to IAM events.

Monitoring & Key Metrics

  • http_requests_total: Measures the rate of incoming API requests.
  • http_request_duration_seconds: Tracks the latency of request handling. Spikes can indicate database performance issues.
  • database_errors_total: A custom metric counting failures during database operations. Any value greater than zero requires immediate investigation.
  • event_bus_errors_total: Counts failures when the Outbox worker attempts to publish events to the message broker (NATS/Kafka).
  • outbox_pending_count: Tracks the number of events waiting in the outbox table. A growing number indicates the publisher is stuck or the broker is down.

Common Alerts & Troubleshooting

Alert: High Latency on API Endpoints

  1. Symptom: Users experience slow response times. The http_request_duration_seconds metric is high.
  2. Check Database: Investigate the performance of the PostgreSQL instance. Are there slow queries? Is the CPU or connection pool saturated?
    # (Example) Connect to the database and check for long-running queries
    kubectl exec -it <postgres-pod-name> -- psql -U <user> -d <db> -c "SELECT * FROM pg_stat_activity WHERE state = 'active';"
  3. Action: Optimize slow queries or scale the database instance if it is under-provisioned.

Alert: High Outbox Pending Count

  1. Symptom: The outbox_pending_count metric is increasing. Downstream services are not receiving updates.
  2. Check Event Bus Connectivity: Verify that the iam-service can reach the NATS or Kafka broker.
  3. Check Logs: Look for error messages related to the EventBusAdapter or OutboxWorker.
    kubectl logs -l app=iam-service | grep "failed to publish event"
  4. Action: Restore connectivity to the message broker. The Outbox worker will automatically retry publishing events once the connection is restored.

Alert: Service is Crash-Looping (CrashLoopBackOff)

  1. Symptom: kubectl get pods shows the iam-service pod is constantly restarting.
  2. Check Logs for Fatal Errors: The most common cause of a startup failure is the inability to connect to a critical dependency.
    # Get logs from the previously terminated container
    kubectl logs <pod-name> --previous
  3. Common Causes:
    • Invalid Database Connection String: The DATABASE_URL environment variable is incorrect.
    • Invalid Event Bus Configuration: The EVENT_BUS_URL or related variables are incorrect.
  4. Action: Check the ConfigMap and Secret mounted by the deployment. Correct any invalid values and re-apply the deployment.

Deployment & Rollback

  • Deployment: Handled automatically by the CI/CD pipeline on merge to the main branch.
  • Rollback: Use the standard Kubernetes rollback procedure if a bad deployment occurs.
    kubectl rollout undo deployment/iam-service