Observability — See Everything in Production

The three pillars of observability — Logs, Metrics, and Traces — plus health checks, alerting, and error budgets. Powered by Kodelyth.

When to Use

Setting up logging for a new service
Adding tracing to an API or background job
Designing a metrics + alerting strategy
Debugging a production issue you can't reproduce locally
Setting up health check endpoints
Building SLOs and error budgets

The Three Pillars

Logs    → What happened (events, errors, audit trail)
Metrics → How much / how often (counts, rates, durations)
Traces  → Why it was slow (request flow across services)

You need all three. Metrics tell you *something is wrong*. Traces tell you *where*. Logs tell you *why*.

Structured Logging

Never use console.log in production

// BAD: Unstructured — impossible to query, filter, or alert on
console.log("User logged in: " + userId)
console.log("Error: " + error.message)
// GOOD: Structured — every field is queryable
logger.info("user.login", {
  userId,
  email: user.email,
  ip: request.ip,
  userAgent: request.headers["user-agent"],
  durationMs: Date.now() - startTime,
})logger.error("payment.charge.failed", {
  userId,
  orderId,
  amount,
  currency,
  errorCode: error.code,
  errorMessage: error.message,
  stripeErrorType: error.type,
})

Log Levels — Use the Right Level

logger.debug("cache.hit", { key, ttlRemaining })   // Dev only, verbose
logger.info("order.created", { orderId, amount })  // Normal business events
logger.warn("rate.limit.approaching", { userId, count, limit }) // Potential problem
logger.error("db.query.failed", { query, error })  // Needs attention
logger.fatal("service.crashed", { reason })        // Immediate action required

Rule: INFO is for things you'd want to search in production. DEBUG is noise you turn on temporarily.

TypeScript Logging Setup (Pino — fastest Node.js logger)

// lib/logger.ts
import pino from 'pino'
export const logger = pino({
  level: process.env.LOG_LEVEL ?? 'info',
  base: {
    service: 'api',
    version: process.env.npm_package_version,
    env: process.env.NODE_ENV,
  },
  // Pretty print in dev, JSON in production
  transport: process.env.NODE_ENV === 'development'
    ? { target: 'pino-pretty', options: { colorize: true } }
    : undefined,
})// Child logger with request context
export function requestLogger(requestId: string, userId?: string) {
  return logger.child({ requestId, userId })
}

Python Logging Setup (structlog)

# lib/logging.py
import structlog
import logging
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.JSONRenderer(),
    ],
    wrapper_class=structlog.stdlib.BoundLogger,
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
)
log = structlog.get_logger()
Usage:
log.info("order.created", order_id=order_id, amount=amount, user_id=user_id)
log.error("payment.failed", error=str(e), order_id=order_id)

Metrics

Naming Convention

# Format: namespace_subsystem_name_unit
http_requests_total              # counter
http_request_duration_seconds    # histogram
db_connections_active            # gauge
cache_hits_total                 # counter
cache_misses_total               # counter
queue_depth_messages             # gauge
payment_amount_dollars           # histogram

Three Metric Types

// Counter — only goes up (requests, errors, events)
const requestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
})
requestsTotal.inc({ method: 'POST', route: '/api/orders', status_code: '201' })
// Gauge — goes up and down (active connections, queue depth, memory)
const activeConnections = new Gauge({
  name: 'db_connections_active',
  help: 'Active database connections',
})
activeConnections.set(pool.totalCount)// Histogram — distribution of values (latency, request size)
const requestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration',
  labelNames: ['method', 'route'],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
})
const end = requestDuration.startTimer({ method, route })
// ... handle request ...
end()  // records duration automatically

The Four Golden Signals (Google SRE)

1. Latency    → How long do requests take? (p50, p95, p99)
Traffic    → How many requests/sec?
Errors     → What % of requests fail?
Saturation → How full is the system? (CPU, memory, queue depth)

Alert on these four before anything else.

Distributed Tracing (OpenTelemetry)

// setup/tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'
import { Resource } from '@opentelemetry/resources'
import { SEMRESATTRS_SERVICE_NAME } from '@opentelemetry/semantic-conventions'
const sdk = new NodeSDK({
  resource: new Resource({
    [SEMRESATTRS_SERVICE_NAME]: 'payment-service',
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
})
sdk.start()  // call before importing anything else
// Usage — spans are created automatically for HTTP requests
// Add custom spans for important operations:
import { trace } from '@opentelemetry/api'
const tracer = trace.getTracer('payment-service')
async function processPayment(orderId: string) {
  const span = tracer.startSpan('payment.process')
  span.setAttributes({ 'order.id': orderId })  try {
    const result = await chargeCard(orderId)
    span.setStatus({ code: SpanStatusCode.OK })
    return result
  } catch (error) {
    span.setStatus({ code: SpanStatusCode.ERROR, message: error.message })
    span.recordException(error)
    throw error
  } finally {
    span.end()
  }
}

Health Check Endpoints

Basic Structure

// Every service must expose /health and /ready
// /health  → is the process alive? (used by load balancer)
// /ready   → is the process ready to serve traffic? (used by k8s)
app.get('/health', (req, res) => {
  res.json({ status: 'ok', uptime: process.uptime() })
})
app.get('/ready', async (req, res) => {
  const checks = await Promise.allSettled([
    checkDatabase(),
    checkRedis(),
    checkExternalApi(),
  ])
  const results = {
    database: checks[0].status === 'fulfilled' ? 'ok' : 'error',
    redis:    checks[1].status === 'fulfilled' ? 'ok' : 'error',
    external: checks[2].status === 'fulfilled' ? 'ok' : 'error',
  }
  const allHealthy = Object.values(results).every(v => v === 'ok')  res.status(allHealthy ? 200 : 503).json({
    status: allHealthy ? 'ready' : 'degraded',
    checks: results,
    timestamp: new Date().toISOString(),
  })
})

Alerting — What to Alert On

Alert Design Rules

1. Alert on SYMPTOMS, not causes
   BAD:  Alert when CPU > 80%
   GOOD: Alert when p99 latency > 2s (the symptom users feel)
Every alert must be actionable
   BAD:  "High error rate" with no runbook
   GOOD: "Error rate > 1% — check Datadog dashboard, likely DB issue"
Alert on SLOs, not arbitrary thresholds
   BAD:  Alert when error rate > 5% (where does 5% come from?)
   GOOD: Alert when error budget burn rate > 2x (based on SLO math)
Avoid alert fatigue
   BAD:  50 alerts, most noise
   GOOD: 5 alerts, all critical, all actionable

The Four Alerts Every Service Needs

# 1. High error rate
alert: HighErrorRate
expr: rate(http_requests_total{status_code=~"5.."}[5m]) /
      rate(http_requests_total[5m]) > 0.01
for: 2m
annotations:
  summary: "Error rate above 1% for 2 minutes"
  runbook: "https://runbooks.example.com/high-error-rate"
2. High latency
alert: HighLatency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
annotations:
  summary: "p99 latency above 2s for 5 minutes"
3. Service down
alert: ServiceDown
expr: up{job="my-service"} == 0
for: 1m
annotations:
  summary: "Service has been down for 1 minute"
4. High queue depth
alert: QueueBacklog
expr: queue_depth_messages > 10000
for: 10m
annotations:
  summary: "Queue depth above 10k for 10 minutes — consumers may be stuck"

Error Budgets & SLOs

SLO: Service Level Objective — the target you're trying to hit SLA: Service Level Agreement — the contract with consequences SLI: Service Level Indicator — the measurement

Example: SLO: 99.9% of requests succeed within 500ms over 30 days SLI: (successful_requests_under_500ms / total_requests) over 30 days Error budget: 0.1% of requests can fail = ~43 minutes of downtime/month

// Error budget calculation
const sloTarget = 0.999           // 99.9%
const errorBudget = 1 - sloTarget // 0.001 = 0.1%
const minutesPerMonth = 30 * 24 * 60  // 43,200 minutes
const allowedDowntime = errorBudget * minutesPerMonth  // 43.2 minutes// Burn rate alert: if you're consuming budget 2x faster than allowed
const burnRateThreshold = 2
// Alert if current_error_rate > burn_rate * (1 - slo_target)

Correlation IDs — Connect Logs Across Services

// Middleware: generate or propagate request ID
app.use((req, res, next) => {
  const requestId = req.headers['x-request-id'] as string
    ?? crypto.randomUUID()
  // Set on response so clients can reference it in support tickets
  res.setHeader('x-request-id', requestId)
  // Make available for the duration of the request
  req.requestId = requestId
  req.log = logger.child({ requestId })
  next()
})// Pass downstream to every service call
async function callPaymentService(orderId: string, requestId: string) {
  return fetch('https://payment-service/charge', {
    headers: {
      'x-request-id': requestId,  // ← propagate to child services
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({ orderId }),
  })
}