Full-Stack LLM Observability Platform: Take-Home Build Guide

Repository: github.com/Subramanyarao11/llm-observe

This project started as a take-home assignment for a Founding Engineer role — I treated it as a chance to build something I'd actually want in production, not a minimum viable submission.

When you ship an LLM-powered feature, the hard part is not calling the API — it is knowing what happened afterward. How long did inference take? How many tokens were consumed? Did the call fail silently? What did the user actually send, and what came back? At scale, synchronous logging inside the request path becomes a bottleneck; storing full prompts creates privacy risk; and ad-hoc provider SDK metrics are inconsistent across OpenAI, Anthropic, and Gemini.

LLM Observe is my answer to that problem: a full-stack observability platform that captures every inference call through an asynchronous pipeline, redacts sensitive data before persistence, and surfaces operational metrics in a real-time dashboard. It runs locally via Docker Compose and deploys to a self-hosted Kubernetes cluster.

This post walks through why I built it the way I did — architecture, tradeoffs, and the details that matter in production.

The Problem: Observability Without Slowing Down Inference

LLM applications have a distinctive observability profile:

Latency-sensitive path — Users expect streaming tokens. Any blocking I/O on the hot path degrades UX.
High cardinality metadata — Provider, model, token counts, time-to-first-token (TTFT), error codes, session IDs.
Privacy constraints — Prompts and completions may contain PII or secrets. You cannot dump raw text into a log table.
Multi-provider reality — OpenAI, Anthropic, and Gemini expose usage data differently, especially during streaming.

The design goal was straightforward: capture rich inference metadata without ever blocking the chat response, and never store raw secrets or unredacted PII.

Architecture at a Glance

End-to-end flow

The user sends a message in the React chat UI. The API streams tokens back via Server-Sent Events (SSE).
The API delegates LLM calls to a shared SDK wrapper with provider-specific adapters.
On every call — success or failure — the SDK fire-and-forgets structured metadata to POST /api/ingest.
The API validates the payload with Zod, enqueues a job to BullMQ, and returns immediately.
A dedicated worker process redacts PII, upserts the log by logId, and updates hourly analytics rollups.
The dashboard reads aggregated metrics from PostgreSQL.

The chat path and the logging path are fully decoupled. A slow database write never delays a streaming response.

Monorepo Structure

The project is a pnpm + Turborepo monorepo:

Splitting the API and worker into separate deployable units was intentional. In production, you want to scale log ingestion independently from chat traffic. An API pod under load should not compete with batch log writes for CPU.

Shared packages (sdk, pii, types, db) keep contracts consistent. The ingest payload shape is defined once in @llm-observe/types and validated at the API boundary.

The SDK: Interceptor Pattern for Free Observability

The heart of the system is @llm-observe/sdk. Instead of asking every developer to remember to log inference calls, the SDK wraps provider adapters and emits metadata automatically.

// Simplified flow in LLMClient
try {
  result = await this.adapters[options.provider].chat(options);
} catch (err) {
  status = "error";
  throw err;
} finally {
  this.emitLog({ logId, latencyMs, inputPreview, outputPreview, ... });
}

Key behaviors:

Streaming token capture — For SSE chat, the SDK accumulates output and reads usage from provider-specific stream events (OpenAI usage, Anthropic message_delta, Gemini metadata). This fixed an early bug where token counts showed zero on the dashboard for streamed responses.
Preview sanitization at source — sanitizePreview() truncates to 500 characters and runs PII redaction before the payload leaves the SDK.
Fire-and-forget ingest — emitLog() POSTs to /api/ingest without awaiting a response on the critical path. Failures are swallowed with a console warning, not thrown into the chat flow.
Provider adapters — OpenAI, Anthropic, and Gemini each implement a common LLMAdapter interface, normalizing request/response handling.

Any application that imports this SDK gets observability without additional instrumentation. That is the same pattern used by APM agents and HTTP middleware — push cross-cutting concerns into a shared layer.

Async Ingestion with BullMQ

Synchronous writes to PostgreSQL on every LLM call would add 10–50ms (or worse) to an already latency-sensitive path. BullMQ on Redis solves this cleanly.

Ingest service responsibilities:

Validate payload with Zod (reject malformed data early)
Enqueue to the inference-logs queue
Return 202 Accepted immediately

Worker responsibilities:

Consume jobs with configurable concurrency
Apply a second pass of PII redaction (defense in depth)
Upsert by logId — retries do not create duplicate rows
Update AnalyticsHourlyRollup for fast dashboard queries
Emit internal events for real-time subscribers

BullMQ retry policy: 3 attempts with exponential backoff starting at 1 second. Failed jobs land in the dead-letter set, visible in Bull Board at /api/admin/queues and via GET /api/admin/failed-jobs.

Why not Kafka?

For this scale, Redis + BullMQ is simpler to operate, sufficient for moderate throughput, and runs in Docker Compose without a Zookeeper cluster. Kafka would make sense at millions of events per day across regions; BullMQ is the right default for a self-hosted observability stack.

Data Model

PostgreSQL holds three core entities:

Conversations and messages

Conversations are first-class objects with provider, model, status, and auto-generated titles (derived from the first user message, truncated to 50 characters). Messages store the full chat content for the UI; inference logs store previews only.

Inference logs (append-only)

Each row represents one LLM call:

Field	Purpose
`latencyMs`, `ttftMs`	End-to-end and time-to-first-token
`promptTokens`, `completionTokens`, `totalTokens`	Usage tracking
`inputPreview`, `outputPreview`	Truncated, redacted text (max 500 chars)
`status`, `errorCode`, `errorMessage`	Failure diagnostics
`metadata`	Provider-specific JSON extras

Indexes on (created_at, provider, status) keep dashboard queries fast.

Analytics hourly rollups

Raw log scans do not scale for dashboard aggregates. A materialized rollup table keyed by (hour, provider) stores pre-aggregated counts, token totals, and latency sums. The worker updates this on every successful write, so the dashboard reads O(hours × providers) instead of O(logs).

Frontend: Chat, Dashboard, and Ops UX

The React app (Vite + TanStack Query) has three main surfaces:

Dashboard — Total requests, average latency, success rate, token usage, provider breakdown, recent errors, CSV export. Metrics refresh after streaming chat completes.

Chat — SSE streaming with provider/model selection, markdown rendering, auto-scroll, and conversation title generation.

Conversations list — Paginated with page numbers, rows-per-page control, and API metadata (totalPages, hasNext, hasPrevious).

Structured API errors ({ code, message }) are parsed consistently and shown as toasts — no raw stack traces in the UI.

Operational Tooling

Production observability is not just a dashboard. The platform includes:

Feature	Implementation
Deep health checks	`GET /api/health` — Postgres and Redis status + latency
Queue monitoring	Bull Board UI + failed-jobs admin API
Rate limiting	Per-session on `/api/chat`, per-IP on `/api/ingest`
OpenTelemetry	Spans across chat → ingest → worker (`withSpan` helper)
Idempotent ingest	Upsert by `logId` in worker
Secret redaction	PII package strips API keys, emails, SSNs, cards, bearer tokens

The PII package applies regex-based redaction for emails, phone numbers, SSNs, credit cards, IP addresses, and common API key formats (OpenAI sk-, Anthropic sk-ant-, Google AIza, bearer tokens). Redaction runs in the SDK and again in the worker.

Deployment: Docker Compose and Kubernetes

Docker Compose (local dev and demo)

One command brings up Postgres, Redis, API, worker, web, and a migration init container:

cp .env.example .env
docker compose up --build

Service	URL
Web UI	http://localhost:5173
API	http://localhost:3001/api
Bull Board	http://localhost:3001/api/admin/queues

Kubernetes (self-hosted)

Full manifests deploy the same stack to minikube (or any standard cluster):

StatefulSets — Postgres (10Gi PVC), Redis
Job — One-shot Prisma schema migration before app pods start
Deployments — API (2 replicas), worker, web (nginx)
Ingress — Routes /api → API, / → web, with SSE-friendly timeouts (proxy-buffering: off, 3600s read timeout)
HPA — API (2–10 replicas) and worker (1–20 replicas) on CPU

A single script handles the full bootstrap:

./scripts/k8s-minikube-deploy.sh
minikube tunnel   # separate terminal
open http://llm-observe.local/

The ingress preserves the /api prefix (NestJS expects /api/health, not /health). That was a subtle bug caught during first deploy — nginx rewrite rules can silently strip path prefixes.

Key Design Decisions

Decision	Rationale
BullMQ over sync writes	Ingestion never blocks chat
Separate worker process	Independent scaling and failure isolation
SSE over WebSockets	LLM streaming is unidirectional; SSE is simpler
Postgres over ClickHouse	Simpler ops; hourly rollups keep dashboards fast at moderate volume
SDK interceptor	Observability is automatic for all SDK consumers
Double PII redaction	SDK + worker = defense in depth
logId upsert	Idempotent retries without duplicate logs
Fastify over Express	Lower overhead for NestJS; better SSE performance
Monorepo	Shared types prevent API/SDK drift

What I Would Add Next

KEDA — Scale workers on queue depth, not just CPU
ClickHouse — Long-retention analytics at high volume
OTLP export — Ship traces to Jaeger or Grafana Tempo (ConsoleSpanExporter works today)
Vault / External Secrets — Replace plain-text k8s secrets in production

Closing Thoughts

LLM Observe is not a wrapper around a provider API — it is an observability pipeline. The interesting engineering is in the decoupling: streaming chat on the hot path, async ingestion on the warm path, PII-safe storage on the cold path, and pre-aggregated rollups for the read path.

If you are building LLM features and treating logging as an afterthought, you will eventually need something like this. Starting with an SDK interceptor and an async queue is a small upfront cost that pays off the first time you need to debug a latency spike, audit token spend, or explain why a user's prompt never got a response.

The full source, Docker setup, and Kubernetes manifests are on GitHub:

github.com/Subramanyarao11/llm-observe

Stack: React · Vite · NestJS · Fastify · BullMQ · Redis · PostgreSQL · Prisma · OpenTelemetry · Docker · Kubernetes · Turborepo · pnpm

I Built a Full-Stack LLM Observability Platform for a Founding Engineer Take-Home

The Problem: Observability Without Slowing Down Inference

Architecture at a Glance

End-to-end flow

Monorepo Structure

The SDK: Interceptor Pattern for Free Observability

Async Ingestion with BullMQ

Why not Kafka?

Data Model

Conversations and messages

Inference logs (append-only)

Analytics hourly rollups

Frontend: Chat, Dashboard, and Ops UX

Operational Tooling

Deployment: Docker Compose and Kubernetes

Docker Compose (local dev and demo)

Kubernetes (self-hosted)

Key Design Decisions

What I Would Add Next

Closing Thoughts

Comments

More from this blog

What I Learned From an Action-Based Go API Layout

WebSocket Chat at Scale: The Architecture Behind Real-Time Messaging That Doesn’t Break

Building a Browser-Based React Native Debug Lab from Scratch

Event-Driven Architecture in NestJS

Command Palette

The Problem: Observability Without Slowing Down Inference

Architecture at a Glance

End-to-end flow

Monorepo Structure

The SDK: Interceptor Pattern for Free Observability

Async Ingestion with BullMQ

Why not Kafka?

Data Model

Conversations and messages

Inference logs (append-only)

Analytics hourly rollups

Frontend: Chat, Dashboard, and Ops UX

Operational Tooling

Deployment: Docker Compose and Kubernetes

Docker Compose (local dev and demo)

Kubernetes (self-hosted)

Key Design Decisions

What I Would Add Next

Closing Thoughts

Comments

More from this blog