Runbook: iam-service

This document provides operational guidance for the iam-service. It is intended for on-call engineers and system administrators responsible for maintaining the service's health.

Service Overview

Purpose: Manages Citadel-specific authorization policies (tenants, roles, user mappings) and provisions client credentials in the upstream IdP.
Architecture: See System Overview and IAM Service Design.
Repository: apps/iam-service

Key Dependencies

Upstream IdP (R-Auth): The service makes real-time API calls to the IdP to create, update, and delete OAuth2 clients. A failure in the IdP's admin API will directly impact client management.
PostgreSQL Database: Stores all policy information (tenants, users, roles, clients). This is the single source of truth for authorization policy.
NATS / Event Bus: Used to publish domain events (e.g., ClientCreated). While a NATS outage won't cause the service to fail, it will prevent other services from reacting to IAM events.

Monitoring & Key Metrics

http_requests_total{handler="/system/enrich-token"}: Measures the rate of claims enrichment calls from the API Gateway. A sudden drop to zero indicates a major problem at the gateway or with the upstream IdP.
http_request_duration_seconds{handler="/system/enrich-token"}: Tracks the latency of policy lookups. Spikes can indicate database performance issues.
idp_adapter_errors_total: A custom metric counting failures when communicating with the upstream IdP. A rising count indicates a problem with the IdP's availability or our credentials.
database_errors_total: A custom metric counting failures during database operations. Any value greater than zero requires immediate investigation.

Common Alerts & Troubleshooting

Alert: High Latency on `/system/enrich-token`

Symptom: Users experience slow login times or API requests are timing out. The http_request_duration_seconds metric for this handler is high.

Check Database: Investigate the performance of the PostgreSQL instance. Are there slow queries? Is the CPU or connection pool saturated?

# (Example) Connect to the database and check for long-running queries
kubectl exec -it <postgres-pod-name> -- psql -U <user> -d <db> -c "SELECT * FROM pg_stat_activity WHERE state = 'active';"

Action: Optimize slow queries or scale the database instance if it is under-provisioned.

Alert: High Error Rate on Client Management Endpoints (`/clients`)

Symptom: The idp_adapter_errors_total metric is increasing. Administrators report being unable to create or update API clients.

Check Service Logs: Look for error messages related to the RAuthAdapter or HTTP client failures.

kubectl logs -l app=iam-service -c iam-service | grep "failed to update client in upstream IdP"

Check IdP Connectivity: Verify that the iam-service can reach the upstream IdP's admin API. Check for network policy issues or DNS problems.
Check IdP Credentials: The admin credentials used by the iam-service to talk to the IdP may have expired or been rotated. Verify the relevant Kubernetes secret is up to date.
Action: Update the credentials secret and restart the iam-service pods.

Alert: Service is Crash-Looping (CrashLoopBackOff)

Symptom: kubectl get pods shows the iam-service pod is constantly restarting.
Check Logs for Fatal Errors: The most common cause of a startup failure is the inability to connect to a critical dependency.
```
# Get logs from the previously terminated container
kubectl logs <pod-name> --previous
```
Common Causes:
- Invalid Database Connection String: The DATABASE_URL environment variable is incorrect.
- Invalid IdP Configuration: The RAUTH_URL or other related environment variables are incorrect.
Action: Check the ConfigMap and Secret mounted by the deployment. Correct any invalid values and re-apply the deployment.

Deployment & Rollback

Deployment: Handled automatically by the CI/CD pipeline on merge to the main branch.
Rollback: Use the standard Kubernetes rollback procedure if a bad deployment occurs.
```
kubectl rollout undo deployment/iam-service
```

Service Overview​

Key Dependencies​

Monitoring & Key Metrics​

Common Alerts & Troubleshooting​

Alert: High Latency on /system/enrich-token​

Alert: High Error Rate on Client Management Endpoints (/clients)​

Alert: Service is Crash-Looping (CrashLoopBackOff)​

Deployment & Rollback​