01. Why This Plane Exists
The observability plane answers three questions about every agent run. What did it actually do? Did it work? And what did it cost? Traditional health checks confirm only that the server returned a two hundred status. They say nothing about whether the model answered the user correctly. That gap, between a transport success and a real semantic failure, is exactly what observability closes. Here is how it works. Every step of one user action drops a small detailed receipt, and those receipts get stapled into a single story we call a run tree. From that tree you trace the whole journey, from the request through every backend call to the model's reply. Tracing is on by default here, never something you switch on. The payoff is honesty. Without it you might see green lights everywhere while the agent quietly hands users wrong answers. The cost report and the admin dashboard both feed on this same data, so one plane turns guesswork into a clear, walkable trail.
02. The System Design View
Observability here is built in three stacked layers, and each one covers a blind spot the others leave open. The base layer is native run tracing, and it captures every model call for free. Its limit is that it cannot see the steps the graph takes between those calls. The second layer, Open Telemetry, fills that gap by routing one whole distributed trace. It stitches the web request, the database query, and each graph step together with the model call into a single backend story. By default those traces flow to the LangSmith ingest endpoint with zero configuration. An operator who wants more control can override that default and send the same spans to a different collector through environment variables. So the real trade off is simplicity against control. Do nothing and it all works. Reach for the dials only if you must. Every span also carries the model name and a run identifier. That identifier and its link let you jump straight to the matching run, which saves real minutes while debugging. The model name is what drives the automatic cost math. The same pattern runs in the web application and in all the workers, so every trace looks consistent. That consistency is what makes the design both cheap and genuinely powerful.
03. Two Tracing Layers
Tracing actually lives in two cooperating layers, and they watch the request from two very different heights. The first layer sits in the TypeScript frontend, where it opens a span for every graph call and forwards the key correlation headers down to the backend. The second layer lives inside the Python workers, and it records each language model call together with its model name, its token count, and its latency. Both layers are strict no-operations whenever the required environment variables are unset or tracing is switched off, so in development and in testing they add genuinely zero overhead. Each captured value is also trimmed to a fixed maximum length, which keeps a single chatty payload from ballooning your storage bill. Working as a pair, the two layers paint one complete picture of a request, from the user's first action right through to the model's final reply. The frontend layer sees the high level shape of the graph. The worker layer captures the fine grained detail of each model interaction. The honest trade off is this. You get deep insight only while tracing is enabled, and the moment it is disabled the system goes quiet and ships nothing. That deliberate quiet is what keeps insight and efficiency in balance.
04. Fully In LangSmith
When tracing is on and you have named no other destination, the system quietly elects LangSmith as the one backend for Open Telemetry. There is a subtle duplication problem it solves first. Native run tracing and the trace pipeline could each report the same call. So the system forces every run to emit purely as a span and pushes it through the Open Telemetry pipeline alone. That single path is what stops any run from being counted twice. The exporter aims at the LangSmith ingestion endpoint with the correct headers, and the result is exactly one distributed trace per user action. An operator who wants out can disable tracing or point that exporter elsewhere. The upside of the default is real. All your data lands under one roof with no setup. The cost is just as real. You are gently locked into LangSmith until you change the configuration by hand. So the trade off is simplicity against flexibility. You get one clean view today, and you can retarget the exporter the day your team prefers another backend.
05. Linking Traces To Runs
Every graph response leaves carrying three correlation headers. The middleware stamps a trace identifier, a run identifier, and a direct link to the tracing tool. Those three together are how you cross link one distributed trace with one specific run. The interesting choice is where this stamping happens. It runs as pure asynchronous server gateway interface (ASGI) middleware, mutating the response headers at the very earliest point, which lets it deliberately skip the heavier base HTTP middleware. Why bother? Staying low keeps the context inside the request's own scope. Every read is also wrapped in a safety net, so a telemetry hiccup can never break a real user response.
On the way back, the TypeScript client reads those headers onto its own span and tries to inject the run identifiers from the current run tree. This part is strictly best effort. If the run tree is missing, the worker simply becomes its own root, which is both common and perfectly acceptable. The payoff for you is direct. From a slow page you click the deep link and land on the exact LangSmith run behind it.
The trade off lives in that self rooting fallback. When the run tree headers are absent, you lose the parent link, and yet that degraded mode is expected and harmless. Because the whole path is a strict no-operation by construction, a response with tracing disabled comes out byte for byte identical to one with tracing fully on.
06. Cost And Feature Attribution
Every graph run records what it cost in two complementary ways, and the redundancy is the whole point. The first way stamps the active telemetry span with token counts, the model used, and the dollar figure. It leans on the standard generation attribute names. Then it layers on the platform's own dimensions for graph, feature, and product vertical. The result is spend you can slice by pillar with a single filter. The second way writes a plain row into a database table, and that row stands on its own with the tracing backend out of the picture. So the two methods divide the labor cleanly. The span annotation gives you rich, ad hoc slicing inside tools like Grafana. The database log is the always on safety net. It is best effort and it never blocks the graph run, so a write failure there quietly drops the row rather than crashing anything. There is even a kill switch that suppresses the row whenever a run made zero model calls. Between them, your cost history is always there, whether you run a full tracing stack or just one humble database.
07. Burn Rate Dashboards
A service level objective, or SLO, simply states a success rate and a latency goal we promise to hold. The clever part is that each objective is keyed over the very span types our tracing already emits, so the dashboard needs no fresh instrumentation. On top of those objectives sits the error budget burn rate dashboard, and it watches several time windows at once to catch trouble fast without drowning anyone in pages.
The fastest window spans one hour. Its threshold is set deliberately high, at fourteen point four times the allowed error rate, which makes a severe outage trip an alarm almost immediately. A slower six hour window uses a gentler threshold of two point four, and it is the one that catches a quiet, gradual decay before it turns critical. A third window stretches across three days with a threshold of one point zero, flagging the slow budget leaks that deserve a ticket rather than a midnight page.
Each window gets its own dashboard panel for the fast, the slow, and the ticket burn rates, and an overview table lists every objective beside its target, its observed rate, and its remaining error budget. The thresholds are tuned so an alert fires only when the fast and slow windows breach together for the same objective. That paired condition is the real win, because a brief one window spike can never page anyone on its own, and only burn that is both sharp and sustained ever wakes the team.
08. Datasets And The Eval Gate
The offline evaluation gate is built on versioned golden datasets, and each core graph owns its own seeded fixtures. Sitting in front of those fixtures is a regression scorer. It judges open ended answers against the signals we expect to see. Every example is tagged with a pass flag set to true, and that flag is the baseline a change has to keep meeting. The clever constraint is that the whole gate runs from a single script. It behaves identically on your laptop and inside continuous integration, so the result you see locally is the result the build pipeline sees. Any edit to a prompt or a model has to clear this gate before it ships. Because the test set is hermetic and database free, nobody needs extra infrastructure to run it.
The fixtures mirror real selling scenarios such as cold outreach and follow up messages. Each one declares the phrases an answer must contain and the phrases it must never contain. The scorer flags anything that drifts outside those rails. That is how broken outputs get caught here, on the workbench, long before a single user ever meets them. One script. One verdict. No separate environments to babysit. When the gate stays green the change is safe to deploy, and when it goes red the offending example points straight at the problem.
09. Privacy And Sampling
To keep costs in check, the system has to decide which traces are worth recording and which it can let go. The default strategy is head sampling, which reads a fixed ratio from an environment variable and makes its keep or drop call right at the start of a trace. That timing is what makes it cheap, and it is also its weakness, because a decision made up front can throw away a trace that only fails much later. A richer alternative, tail sampling, would buffer the entire trace and decide once the outcome is known, but that path is simply not wired into this codebase. A third option keeps everything carrying important tags, though a single leaky tag could send costs through the roof.
Whatever the sampling verdict, telemetry always fails open. When something goes wrong the offending span is dropped, yet the graph response still completes in full, so a tracing error can never crash a real run. Every span that does survive carries cost attributes like the model name and the token totals, all stamped on from the graph's terminal node. The system also records the LangSmith run identifier so that human feedback arriving months later can still be pinned to the original call.
Per-assistant overrides round out the picture. A high-stakes assistant such as email compose is pinned to always on, while a bulk assistant flips down to a low ratio, and today that tuning is enforced by hand through environment variables. The codebase owns head sampling outright, whereas tail and tag based retention remain configuration choices made out of band. The one absolute rule never bends, though. Tracing may lose a span, but the user always gets their answer.