Observability — See Everything in Production
The three pillars of observability — Logs, Metrics, and Traces — plus health checks, alerting, and error budgets. Powered by Kodelyth.
When to Use
- Setting up logging for a new service
- Adding tracing to an API or background job
- Designing a metrics + alerting strategy
- Debugging a production issue you can't reproduce locally
- Setting up health check endpoints
- Building SLOs and error budgets
The Three Pillars
Logs → What happened (events, errors, audit trail)
Metrics → How much / how often (counts, rates, durations)
Traces → Why it was slow (request flow across services)
You need all three. Metrics tell you *something is wrong*. Traces tell you *where*. Logs tell you *why*.
Structured Logging
Never use console.log in production
// BAD: Unstructured — impossible to query, filter, or alert on
console.log("User logged in: " + userId)
console.log("Error: " + error.message)// GOOD: Structured — every field is queryable
logger.info("user.login", {
userId,
email: user.email,
ip: request.ip,
userAgent: request.headers["user-agent"],
durationMs: Date.now() - startTime,
})
logger.error("payment.charge.failed", {
userId,
orderId,
amount,
currency,
errorCode: error.code,
errorMessage: error.message,
stripeErrorType: error.type,
})
Log Levels — Use the Right Level
logger.debug("cache.hit", { key, ttlRemaining }) // Dev only, verbose
logger.info("order.created", { orderId, amount }) // Normal business events
logger.warn("rate.limit.approaching", { userId, count, limit }) // Potential problem
logger.error("db.query.failed", { query, error }) // Needs attention
logger.fatal("service.crashed", { reason }) // Immediate action required
Rule: INFO is for things you'd want to search in production. DEBUG is noise you turn on temporarily.
TypeScript Logging Setup (Pino — fastest Node.js logger)
// lib/logger.ts
import pino from 'pino'export const logger = pino({
level: process.env.LOG_LEVEL ?? 'info',
base: {
service: 'api',
version: process.env.npm_package_version,
env: process.env.NODE_ENV,
},
// Pretty print in dev, JSON in production
transport: process.env.NODE_ENV === 'development'
? { target: 'pino-pretty', options: { colorize: true } }
: undefined,
})
// Child logger with request context
export function requestLogger(requestId: string, userId?: string) {
return logger.child({ requestId, userId })
}
Python Logging Setup (structlog)
# lib/logging.py
import structlog
import loggingstructlog.configure(
processors=[
structlog.stdlib.filter_by_level,
structlog.stdlib.add_logger_name,
structlog.stdlib.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.JSONRenderer(),
],
wrapper_class=structlog.stdlib.BoundLogger,
context_class=dict,
logger_factory=structlog.stdlib.LoggerFactory(),
)
log = structlog.get_logger()
Usage:
log.info("order.created", order_id=order_id, amount=amount, user_id=user_id)
log.error("payment.failed", error=str(e), order_id=order_id)
Metrics
Naming Convention
# Format: namespace_subsystem_name_unit
http_requests_total # counter
http_request_duration_seconds # histogram
db_connections_active # gauge
cache_hits_total # counter
cache_misses_total # counter
queue_depth_messages # gauge
payment_amount_dollars # histogram
Three Metric Types
// Counter — only goes up (requests, errors, events)
const requestsTotal = new Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status_code'],
})
requestsTotal.inc({ method: 'POST', route: '/api/orders', status_code: '201' })// Gauge — goes up and down (active connections, queue depth, memory)
const activeConnections = new Gauge({
name: 'db_connections_active',
help: 'Active database connections',
})
activeConnections.set(pool.totalCount)
// Histogram — distribution of values (latency, request size)
const requestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'route'],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
})
const end = requestDuration.startTimer({ method, route })
// ... handle request ...
end() // records duration automatically
The Four Golden Signals (Google SRE)
1. Latency → How long do requests take? (p50, p95, p99)
- Traffic → How many requests/sec?
- Errors → What % of requests fail?
- Saturation → How full is the system? (CPU, memory, queue depth)
Alert on these four before anything else.
Distributed Tracing (OpenTelemetry)
// setup/tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'
import { Resource } from '@opentelemetry/resources'
import { SEMRESATTRS_SERVICE_NAME } from '@opentelemetry/semantic-conventions'const sdk = new NodeSDK({
resource: new Resource({
[SEMRESATTRS_SERVICE_NAME]: 'payment-service',
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
}),
})
sdk.start() // call before importing anything else
// Usage — spans are created automatically for HTTP requests
// Add custom spans for important operations:
import { trace } from '@opentelemetry/api'
const tracer = trace.getTracer('payment-service')
async function processPayment(orderId: string) {
const span = tracer.startSpan('payment.process')
span.setAttributes({ 'order.id': orderId })
try {
const result = await chargeCard(orderId)
span.setStatus({ code: SpanStatusCode.OK })
return result
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message })
span.recordException(error)
throw error
} finally {
span.end()
}
}
Health Check Endpoints
Basic Structure
// Every service must expose /health and /ready
// /health → is the process alive? (used by load balancer)
// /ready → is the process ready to serve traffic? (used by k8s)app.get('/health', (req, res) => {
res.json({ status: 'ok', uptime: process.uptime() })
})
app.get('/ready', async (req, res) => {
const checks = await Promise.allSettled([
checkDatabase(),
checkRedis(),
checkExternalApi(),
])
const results = {
database: checks[0].status === 'fulfilled' ? 'ok' : 'error',
redis: checks[1].status === 'fulfilled' ? 'ok' : 'error',
external: checks[2].status === 'fulfilled' ? 'ok' : 'error',
}
const allHealthy = Object.values(results).every(v => v === 'ok')
res.status(allHealthy ? 200 : 503).json({
status: allHealthy ? 'ready' : 'degraded',
checks: results,
timestamp: new Date().toISOString(),
})
})
Alerting — What to Alert On
Alert Design Rules
1. Alert on SYMPTOMS, not causes
BAD: Alert when CPU > 80%
GOOD: Alert when p99 latency > 2s (the symptom users feel)- Every alert must be actionable
BAD: "High error rate" with no runbook
GOOD: "Error rate > 1% — check Datadog dashboard, likely DB issue"- Alert on SLOs, not arbitrary thresholds
BAD: Alert when error rate > 5% (where does 5% come from?)
GOOD: Alert when error budget burn rate > 2x (based on SLO math)- Avoid alert fatigue
BAD: 50 alerts, most noise
GOOD: 5 alerts, all critical, all actionable
The Four Alerts Every Service Needs
# 1. High error rate
alert: HighErrorRate
expr: rate(http_requests_total{status_code=~"5.."}[5m]) /
rate(http_requests_total[5m]) > 0.01
for: 2m
annotations:
summary: "Error rate above 1% for 2 minutes"
runbook: "https://runbooks.example.com/high-error-rate"2. High latency
alert: HighLatency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
annotations:
summary: "p99 latency above 2s for 5 minutes"3. Service down
alert: ServiceDown
expr: up{job="my-service"} == 0
for: 1m
annotations:
summary: "Service has been down for 1 minute"4. High queue depth
alert: QueueBacklog
expr: queue_depth_messages > 10000
for: 10m
annotations:
summary: "Queue depth above 10k for 10 minutes — consumers may be stuck"
Error Budgets & SLOs
SLO: Service Level Objective — the target you're trying to hit
SLA: Service Level Agreement — the contract with consequences
SLI: Service Level Indicator — the measurementExample:
SLO: 99.9% of requests succeed within 500ms over 30 days
SLI: (successful_requests_under_500ms / total_requests) over 30 days
Error budget: 0.1% of requests can fail = ~43 minutes of downtime/month
// Error budget calculation
const sloTarget = 0.999 // 99.9%
const errorBudget = 1 - sloTarget // 0.001 = 0.1%
const minutesPerMonth = 30 * 24 * 60 // 43,200 minutes
const allowedDowntime = errorBudget * minutesPerMonth // 43.2 minutes// Burn rate alert: if you're consuming budget 2x faster than allowed
const burnRateThreshold = 2
// Alert if current_error_rate > burn_rate * (1 - slo_target)
Correlation IDs — Connect Logs Across Services
// Middleware: generate or propagate request ID
app.use((req, res, next) => {
const requestId = req.headers['x-request-id'] as string
?? crypto.randomUUID() // Set on response so clients can reference it in support tickets
res.setHeader('x-request-id', requestId)
// Make available for the duration of the request
req.requestId = requestId
req.log = logger.child({ requestId })
next()
})
// Pass downstream to every service call
async function callPaymentService(orderId: string, requestId: string) {
return fetch('https://payment-service/charge', {
headers: {
'x-request-id': requestId, // ← propagate to child services
'Content-Type': 'application/json',
},
body: JSON.stringify({ orderId }),
})
}
> Powered by Kodelyth — you can't fix what you can't see.