Skip to main content

Runbook: iam-service

This document provides operational guidance for the iam-service. It is intended for on-call engineers and system administrators responsible for maintaining the service's health.

Service Overview

  • Purpose: Manages Citadel-specific authorization policies (tenants, roles, user mappings) and provisions client credentials in the upstream IdP.
  • Architecture: See System Overview and IAM Service Design.
  • Repository: apps/iam-service

Key Dependencies

  • Upstream IdP (R-Auth): The service makes real-time API calls to the IdP to create, update, and delete OAuth2 clients. A failure in the IdP's admin API will directly impact client management.
  • PostgreSQL Database: Stores all policy information (tenants, users, roles, clients). This is the single source of truth for authorization policy.
  • NATS / Event Bus: Used to publish domain events (e.g., ClientCreated). While a NATS outage won't cause the service to fail, it will prevent other services from reacting to IAM events.

Monitoring & Key Metrics

  • http_requests_total{handler="/system/enrich-token"}: Measures the rate of claims enrichment calls from the API Gateway. A sudden drop to zero indicates a major problem at the gateway or with the upstream IdP.
  • http_request_duration_seconds{handler="/system/enrich-token"}: Tracks the latency of policy lookups. Spikes can indicate database performance issues.
  • idp_adapter_errors_total: A custom metric counting failures when communicating with the upstream IdP. A rising count indicates a problem with the IdP's availability or our credentials.
  • database_errors_total: A custom metric counting failures during database operations. Any value greater than zero requires immediate investigation.

Common Alerts & Troubleshooting

Alert: High Latency on /system/enrich-token

  1. Symptom: Users experience slow login times or API requests are timing out. The http_request_duration_seconds metric for this handler is high.
  2. Check Database: Investigate the performance of the PostgreSQL instance. Are there slow queries? Is the CPU or connection pool saturated?
    # (Example) Connect to the database and check for long-running queries
    kubectl exec -it <postgres-pod-name> -- psql -U <user> -d <db> -c "SELECT * FROM pg_stat_activity WHERE state = 'active';"
  3. Action: Optimize slow queries or scale the database instance if it is under-provisioned.

Alert: High Error Rate on Client Management Endpoints (/clients)

  1. Symptom: The idp_adapter_errors_total metric is increasing. Administrators report being unable to create or update API clients.
  2. Check Service Logs: Look for error messages related to the RAuthAdapter or HTTP client failures.
    kubectl logs -l app=iam-service -c iam-service | grep "failed to update client in upstream IdP"
  3. Check IdP Connectivity: Verify that the iam-service can reach the upstream IdP's admin API. Check for network policy issues or DNS problems.
  4. Check IdP Credentials: The admin credentials used by the iam-service to talk to the IdP may have expired or been rotated. Verify the relevant Kubernetes secret is up to date.
  5. Action: Update the credentials secret and restart the iam-service pods.

Alert: Service is Crash-Looping (CrashLoopBackOff)

  1. Symptom: kubectl get pods shows the iam-service pod is constantly restarting.
  2. Check Logs for Fatal Errors: The most common cause of a startup failure is the inability to connect to a critical dependency.
    # Get logs from the previously terminated container
    kubectl logs <pod-name> --previous
  3. Common Causes:
    • Invalid Database Connection String: The DATABASE_URL environment variable is incorrect.
    • Invalid IdP Configuration: The RAUTH_URL or other related environment variables are incorrect.
  4. Action: Check the ConfigMap and Secret mounted by the deployment. Correct any invalid values and re-apply the deployment.

Deployment & Rollback

  • Deployment: Handled automatically by the CI/CD pipeline on merge to the main branch.
  • Rollback: Use the standard Kubernetes rollback procedure if a bad deployment occurs.
    kubectl rollout undo deployment/iam-service