I Built a Full-Stack LLM Observability Platform for a Founding Engineer Take-Home
Every inference logged with zero hot-path overhead — SSE streaming, fire-and-forget ingest, and hourly rollups in PostgreSQL.
Repository: github.com/Subramanyarao11/llm-observe
This project started as a take-home assignment for a Founding Engineer role — I treated it as a chance to build something I'd actually want in production, not a minimum viable submission.
When you ship an LLM-powered feature, the hard part is not calling the API — it is knowing what happened afterward. How long did inference take? How many tokens were consumed? Did the call fail silently? What did the user actually send, and what came back? At scale, synchronous logging inside the request path becomes a bottleneck; storing full prompts creates privacy risk; and ad-hoc provider SDK metrics are inconsistent across OpenAI, Anthropic, and Gemini.
LLM Observe is my answer to that problem: a full-stack observability platform that captures every inference call through an asynchronous pipeline, redacts sensitive data before persistence, and surfaces operational metrics in a real-time dashboard. It runs locally via Docker Compose and deploys to a self-hosted Kubernetes cluster.
This post walks through why I built it the way I did — architecture, tradeoffs, and the details that matter in production.
The Problem: Observability Without Slowing Down Inference
LLM applications have a distinctive observability profile:
Latency-sensitive path — Users expect streaming tokens. Any blocking I/O on the hot path degrades UX.
High cardinality metadata — Provider, model, token counts, time-to-first-token (TTFT), error codes, session IDs.
Privacy constraints — Prompts and completions may contain PII or secrets. You cannot dump raw text into a log table.
Multi-provider reality — OpenAI, Anthropic, and Gemini expose usage data differently, especially during streaming.
The design goal was straightforward: capture rich inference metadata without ever blocking the chat response, and never store raw secrets or unredacted PII.
Architecture at a Glance
End-to-end flow
The user sends a message in the React chat UI. The API streams tokens back via Server-Sent Events (SSE).
The API delegates LLM calls to a shared SDK wrapper with provider-specific adapters.
On every call — success or failure — the SDK fire-and-forgets structured metadata to
POST /api/ingest.The API validates the payload with Zod, enqueues a job to BullMQ, and returns immediately.
A dedicated worker process redacts PII, upserts the log by
logId, and updates hourly analytics rollups.The dashboard reads aggregated metrics from PostgreSQL.
The chat path and the logging path are fully decoupled. A slow database write never delays a streaming response.
Monorepo Structure
The project is a pnpm + Turborepo monorepo:
Splitting the API and worker into separate deployable units was intentional. In production, you want to scale log ingestion independently from chat traffic. An API pod under load should not compete with batch log writes for CPU.
Shared packages (sdk, pii, types, db) keep contracts consistent. The ingest payload shape is defined once in @llm-observe/types and validated at the API boundary.
The SDK: Interceptor Pattern for Free Observability
The heart of the system is @llm-observe/sdk. Instead of asking every developer to remember to log inference calls, the SDK wraps provider adapters and emits metadata automatically.
// Simplified flow in LLMClient
try {
result = await this.adapters[options.provider].chat(options);
} catch (err) {
status = "error";
throw err;
} finally {
this.emitLog({ logId, latencyMs, inputPreview, outputPreview, ... });
}
Key behaviors:
Streaming token capture — For SSE chat, the SDK accumulates output and reads usage from provider-specific stream events (OpenAI
usage, Anthropicmessage_delta, Gemini metadata). This fixed an early bug where token counts showed zero on the dashboard for streamed responses.Preview sanitization at source —
sanitizePreview()truncates to 500 characters and runs PII redaction before the payload leaves the SDK.Fire-and-forget ingest —
emitLog()POSTs to/api/ingestwithout awaiting a response on the critical path. Failures are swallowed with a console warning, not thrown into the chat flow.Provider adapters — OpenAI, Anthropic, and Gemini each implement a common
LLMAdapterinterface, normalizing request/response handling.
Any application that imports this SDK gets observability without additional instrumentation. That is the same pattern used by APM agents and HTTP middleware — push cross-cutting concerns into a shared layer.
Async Ingestion with BullMQ
Synchronous writes to PostgreSQL on every LLM call would add 10–50ms (or worse) to an already latency-sensitive path. BullMQ on Redis solves this cleanly.
Ingest service responsibilities:
Validate payload with Zod (reject malformed data early)
Enqueue to the
inference-logsqueueReturn
202 Acceptedimmediately
Worker responsibilities:
Consume jobs with configurable concurrency
Apply a second pass of PII redaction (defense in depth)
Upsert by
logId— retries do not create duplicate rowsUpdate
AnalyticsHourlyRollupfor fast dashboard queriesEmit internal events for real-time subscribers
BullMQ retry policy: 3 attempts with exponential backoff starting at 1 second. Failed jobs land in the dead-letter set, visible in Bull Board at /api/admin/queues and via GET /api/admin/failed-jobs.
Why not Kafka?
For this scale, Redis + BullMQ is simpler to operate, sufficient for moderate throughput, and runs in Docker Compose without a Zookeeper cluster. Kafka would make sense at millions of events per day across regions; BullMQ is the right default for a self-hosted observability stack.
Data Model
PostgreSQL holds three core entities:
Conversations and messages
Conversations are first-class objects with provider, model, status, and auto-generated titles (derived from the first user message, truncated to 50 characters). Messages store the full chat content for the UI; inference logs store previews only.
Inference logs (append-only)
Each row represents one LLM call:
| Field | Purpose |
|---|---|
latencyMs, ttftMs |
End-to-end and time-to-first-token |
promptTokens, completionTokens, totalTokens |
Usage tracking |
inputPreview, outputPreview |
Truncated, redacted text (max 500 chars) |
status, errorCode, errorMessage |
Failure diagnostics |
metadata |
Provider-specific JSON extras |
Indexes on (created_at, provider, status) keep dashboard queries fast.
Analytics hourly rollups
Raw log scans do not scale for dashboard aggregates. A materialized rollup table keyed by (hour, provider) stores pre-aggregated counts, token totals, and latency sums. The worker updates this on every successful write, so the dashboard reads O(hours × providers) instead of O(logs).
Frontend: Chat, Dashboard, and Ops UX
The React app (Vite + TanStack Query) has three main surfaces:
Dashboard — Total requests, average latency, success rate, token usage, provider breakdown, recent errors, CSV export. Metrics refresh after streaming chat completes.
Chat — SSE streaming with provider/model selection, markdown rendering, auto-scroll, and conversation title generation.
Conversations list — Paginated with page numbers, rows-per-page control, and API metadata (totalPages, hasNext, hasPrevious).
Structured API errors ({ code, message }) are parsed consistently and shown as toasts — no raw stack traces in the UI.
Operational Tooling
Production observability is not just a dashboard. The platform includes:
| Feature | Implementation |
|---|---|
| Deep health checks | GET /api/health — Postgres and Redis status + latency |
| Queue monitoring | Bull Board UI + failed-jobs admin API |
| Rate limiting | Per-session on /api/chat, per-IP on /api/ingest |
| OpenTelemetry | Spans across chat → ingest → worker (withSpan helper) |
| Idempotent ingest | Upsert by logId in worker |
| Secret redaction | PII package strips API keys, emails, SSNs, cards, bearer tokens |
The PII package applies regex-based redaction for emails, phone numbers, SSNs, credit cards, IP addresses, and common API key formats (OpenAI sk-, Anthropic sk-ant-, Google AIza, bearer tokens). Redaction runs in the SDK and again in the worker.
Deployment: Docker Compose and Kubernetes
Docker Compose (local dev and demo)
One command brings up Postgres, Redis, API, worker, web, and a migration init container:
cp .env.example .env
docker compose up --build
| Service | URL |
|---|---|
| Web UI | http://localhost:5173 |
| API | http://localhost:3001/api |
| Bull Board | http://localhost:3001/api/admin/queues |
Kubernetes (self-hosted)
Full manifests deploy the same stack to minikube (or any standard cluster):
StatefulSets — Postgres (10Gi PVC), Redis
Job — One-shot Prisma schema migration before app pods start
Deployments — API (2 replicas), worker, web (nginx)
Ingress — Routes
/api→ API,/→ web, with SSE-friendly timeouts (proxy-buffering: off, 3600s read timeout)HPA — API (2–10 replicas) and worker (1–20 replicas) on CPU
A single script handles the full bootstrap:
./scripts/k8s-minikube-deploy.sh
minikube tunnel # separate terminal
open http://llm-observe.local/
The ingress preserves the /api prefix (NestJS expects /api/health, not /health). That was a subtle bug caught during first deploy — nginx rewrite rules can silently strip path prefixes.
Key Design Decisions
| Decision | Rationale |
|---|---|
| BullMQ over sync writes | Ingestion never blocks chat |
| Separate worker process | Independent scaling and failure isolation |
| SSE over WebSockets | LLM streaming is unidirectional; SSE is simpler |
| Postgres over ClickHouse | Simpler ops; hourly rollups keep dashboards fast at moderate volume |
| SDK interceptor | Observability is automatic for all SDK consumers |
| Double PII redaction | SDK + worker = defense in depth |
| logId upsert | Idempotent retries without duplicate logs |
| Fastify over Express | Lower overhead for NestJS; better SSE performance |
| Monorepo | Shared types prevent API/SDK drift |
What I Would Add Next
KEDA — Scale workers on queue depth, not just CPU
ClickHouse — Long-retention analytics at high volume
OTLP export — Ship traces to Jaeger or Grafana Tempo (ConsoleSpanExporter works today)
Vault / External Secrets — Replace plain-text k8s secrets in production
Closing Thoughts
LLM Observe is not a wrapper around a provider API — it is an observability pipeline. The interesting engineering is in the decoupling: streaming chat on the hot path, async ingestion on the warm path, PII-safe storage on the cold path, and pre-aggregated rollups for the read path.
If you are building LLM features and treating logging as an afterthought, you will eventually need something like this. Starting with an SDK interceptor and an async queue is a small upfront cost that pays off the first time you need to debug a latency spike, audit token spend, or explain why a user's prompt never got a response.
The full source, Docker setup, and Kubernetes manifests are on GitHub:
github.com/Subramanyarao11/llm-observe
Stack: React · Vite · NestJS · Fastify · BullMQ · Redis · PostgreSQL · Prisma · OpenTelemetry · Docker · Kubernetes · Turborepo · pnpm



