Runbook: iam-service
This document provides operational guidance for the iam-service. It is intended for on-call engineers and system administrators responsible for maintaining the service's health.
Service Overview
- Purpose: Manages Citadel-specific authorization policies (tenants, roles, user mappings) and provisions client credentials in the upstream IdP.
- Architecture: See System Overview and IAM Service Design.
- Repository:
apps/iam-service
Key Dependencies
- Upstream IdP (R-Auth): The service makes real-time API calls to the IdP to create, update, and delete OAuth2 clients. A failure in the IdP's admin API will directly impact client management.
- PostgreSQL Database: Stores all policy information (tenants, users, roles, clients). This is the single source of truth for authorization policy.
- NATS / Event Bus: Used to publish domain events (e.g.,
ClientCreated). While a NATS outage won't cause the service to fail, it will prevent other services from reacting to IAM events.
Monitoring & Key Metrics
http_requests_total{handler="/system/enrich-token"}: Measures the rate of claims enrichment calls from the API Gateway. A sudden drop to zero indicates a major problem at the gateway or with the upstream IdP.http_request_duration_seconds{handler="/system/enrich-token"}: Tracks the latency of policy lookups. Spikes can indicate database performance issues.idp_adapter_errors_total: A custom metric counting failures when communicating with the upstream IdP. A rising count indicates a problem with the IdP's availability or our credentials.database_errors_total: A custom metric counting failures during database operations. Any value greater than zero requires immediate investigation.
Common Alerts & Troubleshooting
Alert: High Latency on /system/enrich-token
- Symptom: Users experience slow login times or API requests are timing out. The
http_request_duration_secondsmetric for this handler is high. - Check Database: Investigate the performance of the PostgreSQL instance. Are there slow queries? Is the CPU or connection pool saturated?
# (Example) Connect to the database and check for long-running queries
kubectl exec -it <postgres-pod-name> -- psql -U <user> -d <db> -c "SELECT * FROM pg_stat_activity WHERE state = 'active';" - Action: Optimize slow queries or scale the database instance if it is under-provisioned.
Alert: High Error Rate on Client Management Endpoints (/clients)
- Symptom: The
idp_adapter_errors_totalmetric is increasing. Administrators report being unable to create or update API clients. - Check Service Logs: Look for error messages related to the
RAuthAdapteror HTTP client failures.kubectl logs -l app=iam-service -c iam-service | grep "failed to update client in upstream IdP" - Check IdP Connectivity: Verify that the
iam-servicecan reach the upstream IdP's admin API. Check for network policy issues or DNS problems. - Check IdP Credentials: The admin credentials used by the
iam-serviceto talk to the IdP may have expired or been rotated. Verify the relevant Kubernetes secret is up to date. - Action: Update the credentials secret and restart the
iam-servicepods.
Alert: Service is Crash-Looping (CrashLoopBackOff)
- Symptom:
kubectl get podsshows theiam-servicepod is constantly restarting. - Check Logs for Fatal Errors: The most common cause of a startup failure is the inability to connect to a critical dependency.
# Get logs from the previously terminated container
kubectl logs <pod-name> --previous - Common Causes:
- Invalid Database Connection String: The
DATABASE_URLenvironment variable is incorrect. - Invalid IdP Configuration: The
RAUTH_URLor other related environment variables are incorrect.
- Invalid Database Connection String: The
- Action: Check the
ConfigMapandSecretmounted by the deployment. Correct any invalid values and re-apply the deployment.
Deployment & Rollback
- Deployment: Handled automatically by the CI/CD pipeline on merge to the
mainbranch. - Rollback: Use the standard Kubernetes rollback procedure if a bad deployment occurs.
kubectl rollout undo deployment/iam-service