Monitoring & Logging
Comprehensive observability for Cerberus IAM.
Overview
Cerberus IAM provides structured logging, audit trails, and health check endpoints for complete observability.
Logging
Structured Logging
Cerberus IAM uses Pino for high-performance, structured JSON logging.
Log Format:
{
"level": 30,
"time": 1704564290000,
"pid": 12345,
"hostname": "api-server-1",
"reqId": "req-abc123",
"req": {
"method": "POST",
"url": "/v1/auth/login",
"headers": {
"user-agent": "Mozilla/5.0...",
"X-Org-Domain": "acme-corp"
}
},
"res": {
"statusCode": 200
},
"responseTime": 145,
"msg": "request completed"
}Log Levels
Configure via LOG_LEVEL environment variable:
| Level | Code | Usage |
|---|---|---|
| fatal | 60 | Application is unusable |
| error | 50 | Error events that might still allow app to continue |
| warn | 40 | Potentially harmful situations |
| info | 30 | Informational messages (default) |
| debug | 20 | Detailed information for debugging |
| trace | 10 | Very detailed information |
Production: Use info or warnDevelopment: Use debugTroubleshooting: Use trace temporarily
Request Logging
All HTTP requests are logged with:
{
"req": {
"id": "req-abc123",
"method": "GET",
"url": "/v1/me/profile",
"headers": { ... },
"remoteAddress": "192.168.1.100",
"remotePort": 54321
},
"res": {
"statusCode": 200,
"headers": { ... }
},
"responseTime": 42,
"msg": "request completed"
}Redacted fields:
authorizationheadercookieheaderpasswordfieldsclientSecretfields
Error Logging
Errors are logged with stack traces (in non-production):
{
"level": 50,
"err": {
"type": "Error",
"message": "Database connection failed",
"stack": "Error: Database connection failed\\n at..."
},
"msg": "request error"
}Remote Log Export
Export logs to external systems (SIEM, log aggregation):
LOG_REMOTE_URL=https://logs.example.com/ingest
LOG_REMOTE_API_KEY=your-api-key
LOG_REMOTE_BATCH_SIZE=50
LOG_REMOTE_FLUSH_INTERVAL_MS=5000Supported targets:
- Elasticsearch
- Splunk
- Datadog
- New Relic
- CloudWatch Logs
- Google Cloud Logging
- Custom HTTP endpoints
Implementation:
import pino from 'pino';
const logger = pino({
level: config.logLevel,
transport: config.logRemoteUrl
? {
target: 'pino-http-send',
options: {
url: config.logRemoteUrl,
headers: {
Authorization: `Bearer ${config.logRemoteApiKey}`,
},
batchSize: config.logRemoteBatchSize,
interval: config.logRemoteFlushIntervalMs,
},
}
: undefined,
});Audit Logging
Audit Events
Comprehensive audit trail for security-relevant events:
Event Categories:
auth- Authentication events (login, logout, MFA)user- User lifecycle (create, update, delete)client- OAuth client managementpermission- Role and permission changessystem- System events
Event Types:
user.login
user.logout
user.login.failed
user.create
user.update
user.delete
user.mfa.enrolled
user.mfa.verified
user.password.reset
token.issued
token.refreshed
token.revoked
token.introspected
client.create
client.update
client.revoke
role.create
role.update
role.permission.addedAudit Log Schema
{
id: string;
organisationId: string;
userId?: string;
clientId?: string;
eventType: string;
eventCategory: string;
action: 'create' | 'read' | 'update' | 'delete' | 'login' | 'logout';
resourceType?: string;
resourceId?: string;
ipAddress: string;
userAgent: string;
success: boolean;
errorMessage?: string;
metadata?: Record<string, any>;
createdAt: Date;
}Query Audit Logs
Endpoint: GET /v1/admin/audit-logs
curl "https://auth.example.com/v1/admin/audit-logs?eventType=user.login&startDate=2024-01-01&limit=100" \
-H "X-Org-Domain: acme-corp" \
-H "Cookie: cerb_sid=..."Filters:
userId- Filter by userclientId- Filter by OAuth clienteventType- Filter by event typeeventCategory- Filter by categoryaction- Filter by actionstartDate/endDate- Date rangesuccess- Filter by success/failurelimit/offset- Pagination
Response:
{
"auditLogs": [
{
"id": "log-uuid",
"eventType": "user.login",
"eventCategory": "auth",
"action": "login",
"userId": "user-uuid",
"ipAddress": "192.168.1.100",
"userAgent": "Mozilla/5.0...",
"success": true,
"createdAt": "2024-01-15T10:00:00Z"
}
],
"total": 1250,
"limit": 100,
"offset": 0
}Health Checks
Health Endpoint
Endpoint: GET /health
Response:
{
"status": "ok",
"timestamp": "2024-01-15T10:00:00.000Z"
}Status Codes:
200- Healthy503- Unhealthy (future: database checks, etc.)
Load Balancer Configuration
# AWS ALB Target Group
HealthCheck:
Protocol: HTTP
Path: /health
Port: 4000
Interval: 30
Timeout: 5
HealthyThreshold: 2
UnhealthyThreshold: 3
Matcher: 200Kubernetes Probes
livenessProbe:
httpGet:
path: /health
port: 4000
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 4000
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 2Metrics
Key Metrics to Track
Request Metrics:
- Request rate (requests/second)
- Request latency (p50, p95, p99)
- Error rate (4xx, 5xx)
- Status code distribution
Authentication Metrics:
- Login success/failure rate
- MFA verification rate
- Password reset requests
- Session creation rate
- Session expiration rate
OAuth2 Metrics:
- Token issuance rate
- Token refresh rate
- Token revocation rate
- Authorization code usage
- Client usage breakdown
Database Metrics:
- Connection pool usage
- Query duration
- Active connections
- Slow queries (>1s)
- Deadlocks
System Metrics:
- CPU usage
- Memory usage
- Disk I/O
- Network I/O
- Uptime
Exporting Metrics
Prometheus (Future)
import promClient from 'prom-client';
const register = new promClient.Registry();
// HTTP request duration
const httpDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.1, 0.5, 1, 2, 5],
});
// Active sessions
const activeSessions = new promClient.Gauge({
name: 'cerberus_active_sessions',
help: 'Number of active user sessions',
});
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});StatsD
import StatsD from 'node-statsd';
const statsd = new StatsD({
host: 'statsd.internal',
port: 8125,
prefix: 'cerberus.',
});
// Track login
statsd.increment('auth.login.success');
statsd.timing('auth.login.duration', responseTime);Monitoring Tools
Datadog
import tracer from 'dd-trace';
tracer.init({
service: 'cerberus-iam',
env: 'production',
version: '1.0.0',
});New Relic
require('newrelic');AWS CloudWatch
import { CloudWatchClient, PutMetricDataCommand } from '@aws-sdk/client-cloudwatch';
const cloudwatch = new CloudWatchClient({ region: 'us-east-1' });
await cloudwatch.send(
new PutMetricDataCommand({
Namespace: 'CerberusIAM',
MetricData: [
{
MetricName: 'LoginSuccess',
Value: 1,
Unit: 'Count',
Timestamp: new Date(),
},
],
}),
);Alerting
Alert Rules
Critical:
- Health check failures
- Error rate >5%
- Database connection failures
- Disk space <10%
Warning:
- Error rate >1%
- Latency p95 >1s
- Memory usage >80%
- Failed login rate spike
Info:
- Deployment events
- Database migration completion
- Key rotation events
Alert Channels
- Email - For all alerts
- Slack - For critical alerts
- PagerDuty - For production incidents
- SMS - For critical after-hours
Example Alert (Prometheus)
groups:
- name: cerberus_alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: 'High error rate detected'
description: 'Error rate is {{ $value }} (>5%)'
- alert: HighLatency
expr: histogram_quantile(0.95, http_request_duration_seconds) > 1
for: 10m
labels:
severity: warning
annotations:
summary: 'High latency detected'
description: 'P95 latency is {{ $value }}s (>1s)'Dashboards
Grafana Dashboard
{
"dashboard": {
"title": "Cerberus IAM",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total[5m])"
}
]
},
{
"title": "Latency (P95)",
"targets": [
{
"expr": "histogram_quantile(0.95, http_request_duration_seconds)"
}
]
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
}
]
}
]
}
}Key Visualizations
Request Overview
- Request rate timeline
- Status code distribution
- Top endpoints by traffic
Performance
- Latency percentiles (p50, p95, p99)
- Response time heatmap
- Slow endpoint list
Errors
- Error rate timeline
- Error type breakdown
- Error traces
Authentication
- Login success/failure rate
- Active sessions
- MFA usage
Database
- Connection pool usage
- Query duration
- Slow queries
System
- CPU and memory usage
- Disk I/O
- Network throughput
Tracing (Future)
Distributed tracing with OpenTelemetry:
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node';
import { registerInstrumentations } from '@opentelemetry/instrumentation';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
const provider = new NodeTracerProvider();
provider.register();
registerInstrumentations({
instrumentations: [new HttpInstrumentation(), new ExpressInstrumentation()],
});Log Analysis
Common Queries
Failed logins:
cat logs.json | jq 'select(.msg == "login failed")'Slow requests:
cat logs.json | jq 'select(.responseTime > 1000)'Errors:
cat logs.json | jq 'select(.level == 50)' # error levelUser activity:
cat logs.json | jq 'select(.userId == "user-uuid")'Best Practices
- Use structured logging - JSON logs are easier to parse
- Include correlation IDs - Track requests across services
- Redact sensitive data - Never log passwords, tokens, secrets
- Set appropriate log levels - Avoid excessive logging
- Monitor key metrics - Don't rely on logs alone
- Set up alerts - Proactive monitoring beats reactive debugging
- Review logs regularly - Identify patterns and anomalies
- Retain logs appropriately - Balance compliance and storage costs
- Test monitoring - Verify alerts fire correctly
- Document runbooks - Response procedures for common issues
Next Steps
- Production Checklist - Production deployment
- Architecture - System architecture
- Security - Security architecture