LangSmith Observability
🔭 One pane for what the agents did, whether it worked, and what it cost — built up from native run tracing to the "fully in LangSmith" distributed trace, cost attribution, SLOs, and evals.
LangSmith is the observability plane of the agentic-sales platform — the one place that answers three operational questions for every agent run: what did the agents actually do, did it work, and what did it cost? Native run tracing is on by default; on top of it the platform now routes its whole distributed trace into LangSmith as a single pane.
This page is a reference, built from the ground up: first the two tracing layers (native LangChain runs plus the explicit agent/tool spans for loops LangChain can't see), then the "fully in LangSmith" OpenTelemetry default that makes LangSmith the OTLP backend, then the trace↔run cross-link, per-feature cost attribution, the SLO / burn-rate dashboards, and finally the datasets + ≥0.80 eval gate. For the runtime those traces describe, see the LangGraph primer; for the broader observability story, the observability deep-dive.
Each piece leads with a plain-language ELI5, then the system-design detail, then the real code it comes from. Every claim is grounded in the actual agentic-sales backend source — otel_setup.py, langsmith_setup.py, cost_telemetry.py, tracing_headers.py, and run_evals.py — not paraphrased from memory.
Explain it like I'm 5
Imagine every sales workflow is a worker who does a job, then drops a detailed receipt in a shared box. The receipt says which steps ran, what the model was asked and answered, how long it took, whether it failed, and how much it cost. LangSmith is that box plus a viewer: you open one run and see the whole story as a tree. Two switches matter. The first (LANGSMITH_TRACING) turns the receipts on — it's on by default. The second makes sure that all the receipts from one user action — the web request, the database reads, the graph, and every model call — get stapled together into one story instead of scattered into separate piles. That stapling is what "fully in LangSmith" does.
The system-design view
Tracing rides on two independent switches, both safe-by-default. LANGSMITH_TRACING=true (the default) turns on LangChain's native run tracing, which auto-captures every ChatOpenAI call — prompts, completions, tokens, model, latency — and ships it to the agentic-sales project (default_project() in langsmith_setup.py). That alone gives a per-call view but leaves gaps: the platform's JSON-router agent loops dispatch tools outside any LangChain call, so native tracing can't see the tool step. agent_run_span and tool_call_span close that gap by emitting explicit chain and tool runs, so an operator sees the full LLM↔tool loop as one nested tree.
The second layer is OpenTelemetry. Historically OTel was a separate, opt-in path; now otel_setup._apply_langsmith_otel_defaults() makes LangSmith the OTLP backend automatically. When LangSmith is configured and no explicit OTLP endpoint is set, it points the exporter at LangSmith's OTLP ingest, forces LANGSMITH_OTEL_ENABLED=true so LangChain co-emits its runs as OTel spans, and forces LANGSMITH_OTEL_ONLY=true so those runs travel the OTel path only — otherwise native export plus OTel export would both land in the same project and double-record every LLM run. The result is a single distributed trace: web HTTP → graphql.<op> → db.* → langgraph.<assistant> → ChatOpenAI, all under one trace_id, all in LangSmith. The same derivation is mirrored on the web runtime in instrumentation.ts.
On top of the trace, two more planes hang off the same span attributes. Cost attribution (cost_telemetry.py) stamps every graph run with GenAI-semantic-convention token/model attributes plus agentic_sales.* dimensions (graph, feature, vertical, status, cost) so spend is sliceable by product pillar. SLOs (otel_setup.py) define success-rate and latency objectives over those spans with multi-window burn-rate alerts. And the offline eval gate (run_evals.py) holds every prompt/model change to a ≥0.80 accuracy bar, with golden datasets versioned in LangSmith.
The design trades a single vendor lock-in for a layered, override-friendly default: LangSmith is the zero-config destination, but pointing OTEL_EXPORTER_OTLP_ENDPOINT at a collector fans the same spans out to Tempo/Honeycomb while native LangSmith export stays on. Everything is a strict no-op when neither LangSmith nor an OTLP endpoint is configured.
In depth, piece by piece
Each piece: the plain-language take, the system-design detail, then the real code.
Two tracing layers — native runs and explicit spans
In plain terms. LangChain already photographs every model call for free. But when the agent decides which tool to run with plain Python (not a model call), there's no photo of that decision. So the platform adds its own camera around the whole "think → act → observe" loop and around each tool call, so the album shows the full story, not just the model snapshots.
System design. LANGSMITH_TRACING=true enables LangChain auto-instrumentation, which captures every ChatOpenAI.ainvoke as a llm run. The agentic-sales agent loops (e.g. agentic_search_graph._tool_loop) parse {"tool", "args"} JSON and execute Python directly — invisible to that auto-instrumentation. langsmith_setup.agent_run_span wraps the entire loop as one parent chain run whose outputs carry the final answer, step count, accumulated tokens, and cost; tool_call_span wraps each dispatch as a child tool run capturing the args, the result (or error), the attempt number, and a latency_ms. Both are strict no-ops when LANGSMITH_TRACING isn't true or LANGSMITH_API_KEY is missing, and both truncate any single value to LANGSMITH_TRACE_VALUE_CAP chars (default 4000) to bound payload size and PII surface.
with agent_run_span(
"agent_run:agentic_search",
metadata={"question": q},
tags=["provider:deepseek-deep", "max_turns:8"],
vertical="saas",
) as run:
for turn in range(max_turns):
with tool_call_span(tool, args, attempt=turn + 1) as finish:
try:
observation = execute_tool(tool, args, root)
finish(result=observation)
except Exception as exc:
finish(error=exc)
raise
Fully in LangSmith — the OpenTelemetry default
In plain terms. Without this, you'd get two albums: LangSmith for the model calls, and some other tool for the web/database steps — with no way to flip from a slow page to the exact run that caused it. This makes LangSmith the single album: every step of one user action lands in the same trace, and you didn't have to set a single flag to get it.
System design. init_otel() calls _apply_langsmith_otel_defaults() before it checks whether tracing is enabled. When LANGSMITH_TRACING=true and LANGSMITH_API_KEY are set and no explicit OTEL_EXPORTER_OTLP_ENDPOINT is configured, it (1) forces LANGSMITH_OTEL_ENABLED=true so LangChain co-emits runs as OTel spans under the propagated trace context; (2) forces LANGSMITH_OTEL_ONLY=true so those runs travel the OTel path only — avoiding the double-record that targeting LangSmith with both native and OTel export would cause; and (3) sets the traces endpoint to ${LANGSMITH_ENDPOINT}/otel/v1/traces plus x-api-key / Langsmith-Project headers. An operator who wants out sets LANGSMITH_OTEL_ENABLED=false (keeps native tracing, drops OTel); one who wants to fan out elsewhere sets OTEL_EXPORTER_OTLP_ENDPOINT explicitly (native LangSmith export stays on, OTel goes to the other backend, no double-record since the sinks differ). The langsmith SDK detects the process-global TracerProvider that init_otel sets at import time and exports through it. Metrics stay off on this path because LangSmith ingests traces only.
os.environ["LANGSMITH_OTEL_ENABLED"] = "true"
# No explicit endpoint -> LangSmith is the single OTLP backend. Dedupe by
# routing LangChain runs through OTel only, then point the exporter at it.
os.environ.setdefault("LANGSMITH_OTEL_ONLY", "true")
os.environ["OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"] = _langsmith_otel_traces_endpoint()
os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"x-api-key={api_key},Langsmith-Project={project}"
The trace ↔ run cross-link
In plain terms. Two ID stickers ride back on every backend response: one points at the distributed trace, one is a deep link straight to the LangSmith run. So from any web request you can jump to its run, and from a run back to its trace — no hunting.
System design. TraceLinkHeadersMiddleware (tracing_headers.py) is a pure-ASGI middleware on the LangGraph entrypoint that stamps up to three response headers on every /runs/wait: x-trace-id (the active OTel span's 32-hex trace id — reliable whenever OTel is armed), and the best-effort x-langsmith-run-id + x-langsmith-run-url (the LangSmith root run id from get_current_run_tree(), plus a deep link, present only when a run tree is live in the request context). The TypeScript client reads them back onto its langgraph.<assistant_id> span as langsmith.run_id, langsmith.run_url, and peer.trace_id. It's a strict no-op when neither subsystem is active — the response is byte-identical — and every read is wrapped so a telemetry failure can never break a graph response.
def build_link_headers() -> dict[str, str]:
headers: dict[str, str] = {}
trace_id = _otel_trace_id_hex() # active OTel span -> 032x hex
if trace_id:
headers["x-trace-id"] = trace_id
run_id = _langsmith_root_run_id() # get_current_run_tree().trace_id
if run_id:
headers["x-langsmith-run-id"] = run_id
url = _langsmith_run_url(run_id) # deep link into the LangSmith UI
if url:
headers["x-langsmith-run-url"] = url
return headers
Cost & feature attribution
In plain terms. Every run's receipt also records the bill: how many tokens, which model, how much money — tagged with which part of the business it was for (lead-gen, outreach, research…). So "what are we spending per pillar?" is one filter, not a spreadsheet exercise. And it works even with no fancy backend, because the same numbers also print to the logs.
System design. Every graph terminal node calls graph_meta(graph=…, totals=compute_totals(…)), which routes through cost_telemetry.annotate_current_span() to stamp the active OTel span (the FastAPI server span, so cost rides the request trace). Attributes follow the GenAI semantic conventions where one exists — gen_ai.usage.{input,output,total}_tokens, gen_ai.request.model — and agentic_sales.* otherwise: graph, feature (the product pillar: lead_gen, email_outreach, research_authoring, job_application, sales_tech_intel, or uncategorized), vertical, status (ok/error — many graphs return a logical failure with HTTP 200, so this is the only way to slice failed runs and the cost wasted on them), cost_usd, llm_latency_ms, llm_calls, and llm_retries (only when non-zero). It's strict no-op aware: with tracing off, get_current_span() returns the invalid sentinel whose set_attribute is a no-op. Independently, record_graph_cost emits one structured INFO line per run via the agentic_sales.cost logger — the only cost sink that works with no OTLP endpoint and no LangSmith — and scripts/cost_dashboard.py rolls either source up into a spend-by-pillar table.
attrs = {
"agentic_sales.graph": graph,
"agentic_sales.feature": feature or feature_for_graph(graph), # product pillar
"agentic_sales.vertical": vertical or "", # never null
"agentic_sales.status": "error" if error else "ok",
}
if model:
attrs["gen_ai.request.model"] = model
attrs["gen_ai.usage.input_tokens"] = int(in_tok)
attrs["gen_ai.usage.output_tokens"] = int(out_tok)
attrs["agentic_sales.cost_usd"] = float(cost) # group_by feature in the OTLP backend
# Always-on cost log (works with no OTLP + no LangSmith):
# cost graph=<g> feature=<pillar> vertical=<v> status=<ok|error> \
# cost_usd=<n> tokens=<n> calls=<n> model=<m>
SLOs & burn-rate dashboards
In plain terms. Spans aren't just a story — they're a heartbeat. The platform sets target reliability and speed for each layer, then watches two clocks: a fast one that screams within an hour if things break hard, and a slow one that catches a quiet leak over six. It only pages when both agree, so you're not woken by noise.
System design. otel_setup.SLO_DEFINITIONS defines success-rate and latency SLOs over the span types the plane already emits — GraphQL operations (≥99.5% success), LangGraph runs (≥98%, using agentic_sales.status=ok), and DeepSeek LLM calls (≥99%, plus p95/p99 latency bars). BURN_RATE_WINDOWS pairs a fast 1-hour window (burn-rate ≥ 14.4×) with a slow 6-hour window (≥ 6.0×) and a 3-day ticket window (≥ 1.0×), following the Google SRE multi-window approach: fire only when the fast and slow windows both breach, cutting false positives. slo_burn_rate_spec() returns the whole spec (SLOs + windows + vendor-neutral Grafana/Honeycomb panel hints) as a plain dict, with an active flag that guards on OTEL_EXPORTER_OTLP_ENDPOINT so the panels are only meaningful with an OTLP backend. No new instrumentation — these are window queries over existing span attributes.
SLO_DEFINITIONS = [
{"slo_id": "graphql_success_rate", "target": 0.995, "metric": "success_rate"},
{"slo_id": "langgraph_success_rate", "target": 0.98, "metric": "success_rate"},
{"slo_id": "llm_call_success_rate", "target": 0.99, "metric": "success_rate"},
# + graphql_latency_p99, langgraph_latency_p95, llm_call_latency_p95
]
# Multi-window AND: page only when the 1h fast window (>=14.4x) AND the 6h slow
# window (>=6.0x) both breach the same SLO.
Datasets & the ≥0.80 eval gate
In plain terms. Before any prompt or model change ships, it has to pass an open-book exam graded against saved "right answers." If the class average drops below 80%, the change is blocked. The exams (datasets) live in LangSmith so they're versioned and shared.
System design. scripts/run_evals.py maps eval coverage for every *_graph.py and enforces a ≥0.80 aggregate DeepEval gate (AGGREGATE_PASS = 0.80) — run via pnpm test:eval (uv run python scripts/run_evals.py --gate) after any LLM/prompt change. Golden datasets and prompt-version regression triggers are registered in eval/langsmith_datasets.py (e.g. the agentic-sales:campaign:final_response dataset backing the compose_touch prompt), so a prompt edit that regresses against the saved reference is caught. Batch/DB graphs that aren't single-row-invokable are covered as delegates. This is the "Eval-First" guarantee from the optimization strategy: every prompt/model change is tested against the ≥0.80 bar before it can ship.
pnpm test:eval
# → cd backend && uv run python scripts/run_evals.py --gate
# AGGREGATE_PASS = 0.80 — the aggregate DeepEval accuracy bar every change must clear
Privacy & sampling
In plain terms. Traces can contain real lead and contact text, so the platform caps how much of any value it ships and lets you sample down the routine traffic — while always keeping the traces that errored or ran slow, because those are the ones worth reading.
System design. Every tool span value is truncated to LANGSMITH_TRACE_VALUE_CAP (default 4000) chars; docs/PRIVACY.md notes LangSmith traces carry PII, so the cap bounds the leak surface. On the OTel side, head sampling is configured by OTEL_TRACES_SAMPLER (parentbased_traceidratio at 5% is the prod recommendation), and an in-process _TailSamplingExporter always promotes error and slow spans regardless of the head decision, so reducing volume never drops the interesting traces. All of it is no-op when the OTLP endpoint is unset.
See also
- LangGraph primer — the runtime these traces describe: graphs, nodes, state, durable execution.
- Observability deep-dive — the distributed run tree that ties a single user action across the TS→Python hop.
- Evaluation & Feedback — the eval and feedback loop in its own right.
- Read the full transcript · listen to the audio guide.