Skip to main content

Command Palette

Search for a command to run...

I Built a Full-Stack LLM Observability Platform for a Founding Engineer Take-Home

Every inference logged with zero hot-path overhead — SSE streaming, fire-and-forget ingest, and hourly rollups in PostgreSQL.

Updated
9 min read
I Built a Full-Stack LLM Observability Platform for a Founding Engineer Take-Home
S
Sr Software Engineer - All things web and mobile

Repository: github.com/Subramanyarao11/llm-observe

This project started as a take-home assignment for a Founding Engineer role — I treated it as a chance to build something I'd actually want in production, not a minimum viable submission.

When you ship an LLM-powered feature, the hard part is not calling the API — it is knowing what happened afterward. How long did inference take? How many tokens were consumed? Did the call fail silently? What did the user actually send, and what came back? At scale, synchronous logging inside the request path becomes a bottleneck; storing full prompts creates privacy risk; and ad-hoc provider SDK metrics are inconsistent across OpenAI, Anthropic, and Gemini.

LLM Observe is my answer to that problem: a full-stack observability platform that captures every inference call through an asynchronous pipeline, redacts sensitive data before persistence, and surfaces operational metrics in a real-time dashboard. It runs locally via Docker Compose and deploys to a self-hosted Kubernetes cluster.

This post walks through why I built it the way I did — architecture, tradeoffs, and the details that matter in production.


The Problem: Observability Without Slowing Down Inference

LLM applications have a distinctive observability profile:

  1. Latency-sensitive path — Users expect streaming tokens. Any blocking I/O on the hot path degrades UX.

  2. High cardinality metadata — Provider, model, token counts, time-to-first-token (TTFT), error codes, session IDs.

  3. Privacy constraints — Prompts and completions may contain PII or secrets. You cannot dump raw text into a log table.

  4. Multi-provider reality — OpenAI, Anthropic, and Gemini expose usage data differently, especially during streaming.

The design goal was straightforward: capture rich inference metadata without ever blocking the chat response, and never store raw secrets or unredacted PII.


Architecture at a Glance

End-to-end flow

  1. The user sends a message in the React chat UI. The API streams tokens back via Server-Sent Events (SSE).

  2. The API delegates LLM calls to a shared SDK wrapper with provider-specific adapters.

  3. On every call — success or failure — the SDK fire-and-forgets structured metadata to POST /api/ingest.

  4. The API validates the payload with Zod, enqueues a job to BullMQ, and returns immediately.

  5. A dedicated worker process redacts PII, upserts the log by logId, and updates hourly analytics rollups.

  6. The dashboard reads aggregated metrics from PostgreSQL.

The chat path and the logging path are fully decoupled. A slow database write never delays a streaming response.


Monorepo Structure

The project is a pnpm + Turborepo monorepo:

Splitting the API and worker into separate deployable units was intentional. In production, you want to scale log ingestion independently from chat traffic. An API pod under load should not compete with batch log writes for CPU.

Shared packages (sdk, pii, types, db) keep contracts consistent. The ingest payload shape is defined once in @llm-observe/types and validated at the API boundary.


The SDK: Interceptor Pattern for Free Observability

The heart of the system is @llm-observe/sdk. Instead of asking every developer to remember to log inference calls, the SDK wraps provider adapters and emits metadata automatically.

// Simplified flow in LLMClient
try {
  result = await this.adapters[options.provider].chat(options);
} catch (err) {
  status = "error";
  throw err;
} finally {
  this.emitLog({ logId, latencyMs, inputPreview, outputPreview, ... });
}

Key behaviors:

  • Streaming token capture — For SSE chat, the SDK accumulates output and reads usage from provider-specific stream events (OpenAI usage, Anthropic message_delta, Gemini metadata). This fixed an early bug where token counts showed zero on the dashboard for streamed responses.

  • Preview sanitization at sourcesanitizePreview() truncates to 500 characters and runs PII redaction before the payload leaves the SDK.

  • Fire-and-forget ingestemitLog() POSTs to /api/ingest without awaiting a response on the critical path. Failures are swallowed with a console warning, not thrown into the chat flow.

  • Provider adapters — OpenAI, Anthropic, and Gemini each implement a common LLMAdapter interface, normalizing request/response handling.

Any application that imports this SDK gets observability without additional instrumentation. That is the same pattern used by APM agents and HTTP middleware — push cross-cutting concerns into a shared layer.


Async Ingestion with BullMQ

Synchronous writes to PostgreSQL on every LLM call would add 10–50ms (or worse) to an already latency-sensitive path. BullMQ on Redis solves this cleanly.

Ingest service responsibilities:

  • Validate payload with Zod (reject malformed data early)

  • Enqueue to the inference-logs queue

  • Return 202 Accepted immediately

Worker responsibilities:

  • Consume jobs with configurable concurrency

  • Apply a second pass of PII redaction (defense in depth)

  • Upsert by logId — retries do not create duplicate rows

  • Update AnalyticsHourlyRollup for fast dashboard queries

  • Emit internal events for real-time subscribers

BullMQ retry policy: 3 attempts with exponential backoff starting at 1 second. Failed jobs land in the dead-letter set, visible in Bull Board at /api/admin/queues and via GET /api/admin/failed-jobs.

Why not Kafka?

For this scale, Redis + BullMQ is simpler to operate, sufficient for moderate throughput, and runs in Docker Compose without a Zookeeper cluster. Kafka would make sense at millions of events per day across regions; BullMQ is the right default for a self-hosted observability stack.


Data Model

PostgreSQL holds three core entities:

Conversations and messages

Conversations are first-class objects with provider, model, status, and auto-generated titles (derived from the first user message, truncated to 50 characters). Messages store the full chat content for the UI; inference logs store previews only.

Inference logs (append-only)

Each row represents one LLM call:

Field Purpose
latencyMs, ttftMs End-to-end and time-to-first-token
promptTokens, completionTokens, totalTokens Usage tracking
inputPreview, outputPreview Truncated, redacted text (max 500 chars)
status, errorCode, errorMessage Failure diagnostics
metadata Provider-specific JSON extras

Indexes on (created_at, provider, status) keep dashboard queries fast.

Analytics hourly rollups

Raw log scans do not scale for dashboard aggregates. A materialized rollup table keyed by (hour, provider) stores pre-aggregated counts, token totals, and latency sums. The worker updates this on every successful write, so the dashboard reads O(hours × providers) instead of O(logs).


Frontend: Chat, Dashboard, and Ops UX

The React app (Vite + TanStack Query) has three main surfaces:

Dashboard — Total requests, average latency, success rate, token usage, provider breakdown, recent errors, CSV export. Metrics refresh after streaming chat completes.

Chat — SSE streaming with provider/model selection, markdown rendering, auto-scroll, and conversation title generation.

Conversations list — Paginated with page numbers, rows-per-page control, and API metadata (totalPages, hasNext, hasPrevious).

Structured API errors ({ code, message }) are parsed consistently and shown as toasts — no raw stack traces in the UI.


Operational Tooling

Production observability is not just a dashboard. The platform includes:

Feature Implementation
Deep health checks GET /api/health — Postgres and Redis status + latency
Queue monitoring Bull Board UI + failed-jobs admin API
Rate limiting Per-session on /api/chat, per-IP on /api/ingest
OpenTelemetry Spans across chat → ingest → worker (withSpan helper)
Idempotent ingest Upsert by logId in worker
Secret redaction PII package strips API keys, emails, SSNs, cards, bearer tokens

The PII package applies regex-based redaction for emails, phone numbers, SSNs, credit cards, IP addresses, and common API key formats (OpenAI sk-, Anthropic sk-ant-, Google AIza, bearer tokens). Redaction runs in the SDK and again in the worker.


Deployment: Docker Compose and Kubernetes

Docker Compose (local dev and demo)

One command brings up Postgres, Redis, API, worker, web, and a migration init container:

cp .env.example .env
docker compose up --build
Service URL
Web UI http://localhost:5173
API http://localhost:3001/api
Bull Board http://localhost:3001/api/admin/queues

Kubernetes (self-hosted)

Full manifests deploy the same stack to minikube (or any standard cluster):

  • StatefulSets — Postgres (10Gi PVC), Redis

  • Job — One-shot Prisma schema migration before app pods start

  • Deployments — API (2 replicas), worker, web (nginx)

  • Ingress — Routes /api → API, / → web, with SSE-friendly timeouts (proxy-buffering: off, 3600s read timeout)

  • HPA — API (2–10 replicas) and worker (1–20 replicas) on CPU

A single script handles the full bootstrap:

./scripts/k8s-minikube-deploy.sh
minikube tunnel   # separate terminal
open http://llm-observe.local/

The ingress preserves the /api prefix (NestJS expects /api/health, not /health). That was a subtle bug caught during first deploy — nginx rewrite rules can silently strip path prefixes.


Key Design Decisions

Decision Rationale
BullMQ over sync writes Ingestion never blocks chat
Separate worker process Independent scaling and failure isolation
SSE over WebSockets LLM streaming is unidirectional; SSE is simpler
Postgres over ClickHouse Simpler ops; hourly rollups keep dashboards fast at moderate volume
SDK interceptor Observability is automatic for all SDK consumers
Double PII redaction SDK + worker = defense in depth
logId upsert Idempotent retries without duplicate logs
Fastify over Express Lower overhead for NestJS; better SSE performance
Monorepo Shared types prevent API/SDK drift

What I Would Add Next

  • KEDA — Scale workers on queue depth, not just CPU

  • ClickHouse — Long-retention analytics at high volume

  • OTLP export — Ship traces to Jaeger or Grafana Tempo (ConsoleSpanExporter works today)

  • Vault / External Secrets — Replace plain-text k8s secrets in production


Closing Thoughts

LLM Observe is not a wrapper around a provider API — it is an observability pipeline. The interesting engineering is in the decoupling: streaming chat on the hot path, async ingestion on the warm path, PII-safe storage on the cold path, and pre-aggregated rollups for the read path.

If you are building LLM features and treating logging as an afterthought, you will eventually need something like this. Starting with an SDK interceptor and an async queue is a small upfront cost that pays off the first time you need to debug a latency spike, audit token spend, or explain why a user's prompt never got a response.

The full source, Docker setup, and Kubernetes manifests are on GitHub:

github.com/Subramanyarao11/llm-observe


Stack: React · Vite · NestJS · Fastify · BullMQ · Redis · PostgreSQL · Prisma · OpenTelemetry · Docker · Kubernetes · Turborepo · pnpm