Function calling has emerged as the foundational mechanism through which large language models interact with external systems, transforming LLMs from text generators into capable agents. This article examines the design of function calling APIs across major providers, the role of JSON Schema in tool definitions, parallel execution strategies, sandboxing considerations, and patterns for building reliable tool pipelines that handle errors gracefully and scale to complex workflows.
Before dedicated function calling APIs existed, developers resorted to prompt engineering: instructing the model to output JSON in a particular format, then parsing the result with fragile regex or string matching. This approach was error-prone, with models frequently producing malformed output, hallucinating function names, or embedding function calls within conversational text that resisted reliable extraction (see also Article 10: Structured Output for how constrained decoding solves the output formatting problem more broadly).
OpenAI's introduction of function calling in June 2023 marked a turning point. By moving tool definitions into a structured API parameter, the model could be trained to produce tool invocations as structured objects rather than freeform text. This architectural decision -- separating the tool invocation channel from the conversational channel -- solved the parsing problem and opened the door to reliable agent systems.
At the API level, function calling involves three phases:
This creates a multi-turn conversation pattern where the model, runtime, and external systems collaborate:
import openai
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and state, e.g. San Francisco, CA"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}
}
]
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools
)
# Model returns: tool_calls=[{id: "call_abc", function: {name: "get_weather", arguments: '{"location": "Tokyo, Japan"}'}}]
OpenAI's implementation introduced the tools parameter (replacing the earlier functions parameter) with support for multiple tool types. Key design decisions include:
tool_choice: Controls whether the model must call a tool ("required"), can choose ("auto"), or must not ("none"). The {"type": "function", "function": {"name": "specific_function"}} form forces a specific tool."strict": true is set on a function definition, the model's output is guaranteed to conform to the provided JSON Schema via constrained decoding.Strict mode deserves particular attention. Without it, models occasionally produce arguments that violate the schema -- missing required fields, wrong types, or extra properties. With strict mode, OpenAI uses constrained decoding (likely a context-free grammar or finite automaton approach) to ensure every generated token leads to valid JSON conforming to the schema. The tradeoff is a slight increase in first-token latency and restrictions on supported schema features (no oneOf, limited recursion).
Anthropic's tool use API follows a similar pattern but with notable differences:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
tools=[{
"name": "get_stock_price",
"description": "Gets the current stock price for a given ticker symbol.",
"input_schema": {
"type": "object",
"properties": {
"ticker": {
"type": "string",
"description": "The stock ticker symbol, e.g. AAPL"
}
},
"required": ["ticker"]
}
}],
messages=[{"role": "user", "content": "What's Apple's stock price?"}]
)
Key differences from OpenAI:
input_schema instead of parameters for the JSON Schema definitiontool_use content blocks within the assistant message, alongside optional text blocks. This allows the model to explain its reasoning before or after tool invocations.tool_result messages: Results are sent back as user messages containing tool_result content blocks with matching tool_use_id.Gemini's function calling API introduces additional concepts:
AUTO, ANY (must call at least one function), and NONEJSON Schema (drafts 2020-12 and earlier) has become the de facto standard for defining tool interfaces. This choice is pragmatic: JSON Schema is widely adopted, has good tooling support, and maps naturally to the structured output that LLMs need to produce.
The quality of tool definitions directly impacts the model's ability to use them correctly. Several principles emerge from practice:
Descriptive field names and descriptions matter enormously. The model uses these as semantic cues. Compare:
{
"name": "q",
"description": "Query parameter",
"type": "string"
}
versus:
{
"name": "search_query",
"description": "The search query string. Supports boolean operators (AND, OR, NOT) and phrase matching with quotes. Example: 'machine learning AND \"neural networks\"'",
"type": "string"
}
The second definition gives the model enough context to construct effective queries without additional prompting.
Enums constrain the output space. When a parameter has a known set of valid values, using "enum" prevents hallucination of invalid options and enables constrained decoding optimizations:
{
"name": "priority",
"type": "string",
"enum": ["low", "medium", "high", "critical"],
"description": "Task priority level"
}
Nested objects model complex inputs. Real-world tools often need structured inputs:
{
"name": "create_calendar_event",
"parameters": {
"type": "object",
"properties": {
"title": {"type": "string"},
"time_range": {
"type": "object",
"properties": {
"start": {"type": "string", "format": "date-time"},
"end": {"type": "string", "format": "date-time"}
},
"required": ["start", "end"]
},
"attendees": {
"type": "array",
"items": {"type": "string", "format": "email"}
}
},
"required": ["title", "time_range"]
}
}
Not all JSON Schema features are supported uniformly. $ref for recursive schemas, oneOf/anyOf for union types, and patternProperties have inconsistent support. When strict mode is required, developers must often flatten schemas and use simpler constructs.
A common pattern for handling polymorphic inputs is the discriminated union:
{
"type": "object",
"properties": {
"action_type": {"type": "string", "enum": ["send_email", "create_task"]},
"email_to": {"type": "string"},
"email_body": {"type": "string"},
"task_title": {"type": "string"},
"task_due_date": {"type": "string"}
},
"required": ["action_type"]
}
This is less elegant than a proper union type but works reliably across providers.
When a user request requires multiple independent pieces of information, sequential tool calls waste time and tokens. Parallel function calling allows the model to emit multiple tool calls in a single response:
# Model response might contain:
# tool_calls = [
# {id: "call_1", function: {name: "get_weather", arguments: '{"location": "Tokyo"}'}},
# {id: "call_2", function: {name: "get_weather", arguments: '{"location": "London"}'}},
# {id: "call_3", function: {name: "get_exchange_rate", arguments: '{"from": "JPY", "to": "GBP"}'}}
# ]
import asyncio
async def execute_tool_calls(tool_calls):
tasks = []
for call in tool_calls:
func = tool_registry[call.function.name]
args = json.loads(call.function.arguments)
tasks.append(asyncio.create_task(func(**args)))
return await asyncio.gather(*tasks)
The runtime executes all three calls concurrently, then sends all results back in a single message. This reduces the number of LLM round-trips and can significantly decrease end-to-end latency.
However, parallel calling introduces ordering considerations. The model must determine which calls are independent (can be parallelized) versus dependent (must be sequential). In practice, models handle this well for obvious cases but can sometimes attempt to parallelize calls that have implicit dependencies.
When function calls interact with external systems, security becomes paramount. Several approaches to sandboxing exist:
The simplest approach is restricting which tools are available based on the context:
def get_tools_for_context(user_role, context):
base_tools = [search_tool, calculator_tool]
if user_role == "admin":
base_tools.extend([delete_tool, modify_tool])
if context == "readonly":
base_tools = [t for t in base_tools if t.metadata.get("safe", False)]
return base_tools
For high-stakes operations, requiring user confirmation before execution (see Article 12: Adversarial Prompting for why this is especially critical when user inputs may contain injected tool instructions):
async def execute_with_confirmation(tool_call, user_session):
risk_level = assess_risk(tool_call)
if risk_level == "high":
approved = await user_session.request_confirmation(
f"Allow {tool_call.function.name} with args {tool_call.function.arguments}?"
)
if not approved:
return {"error": "User denied execution", "tool_call_id": tool_call.id}
return await execute(tool_call)
For code execution tools, running in isolated containers (e.g., E2B, Docker, gVisor) prevents the agent from affecting the host system:
import e2b
sandbox = e2b.Sandbox()
result = sandbox.run_code(generated_code, language="python", timeout=30)
Robust tool pipelines must handle failures at every level. A taxonomy of tool call errors includes:
The model produces arguments that don't match the schema. With strict mode this is prevented; without it, validate before execution:
import jsonschema
def validate_and_execute(tool_call, tool_definitions):
tool_def = tool_definitions[tool_call.function.name]
args = json.loads(tool_call.function.arguments)
try:
jsonschema.validate(args, tool_def["parameters"])
except jsonschema.ValidationError as e:
return {
"error": f"Invalid arguments: {e.message}",
"tool_call_id": tool_call.id
}
return execute(tool_call)
The function itself fails (API timeout, rate limit, invalid input that passed schema validation):
def execute_with_retry(func, args, max_retries=3):
for attempt in range(max_retries):
try:
result = func(**args)
return {"success": True, "data": result}
except RateLimitError:
time.sleep(2 ** attempt)
except Exception as e:
return {"success": False, "error": str(e)}
return {"success": False, "error": "Max retries exceeded"}
A critical pattern is returning error information to the model so it can self-correct:
tool_result_message = {
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps({
"error": "FileNotFoundError: /data/report.csv not found",
"suggestion": "Available files: /data/report_2024.csv, /data/report_2023.csv"
})
}
# The model can then retry with the correct filename
This error-retry loop is one of the most powerful patterns in agent systems and forms the backbone of the ReAct loop discussed in Article 26: Agent Architectures. Research from Microsoft (Patil et al., "Gorilla: Large Language Model Connected with Massive APIs," 2023) shows that models can effectively recover from errors when given clear error messages and contextual hints.
As tool calls accumulate, the conversation context grows. Each tool call adds the assistant's invocation and the tool's result to the message history. For complex workflows with many tools, this can quickly consume the context window.
For tools that return large payloads, summarize before injecting:
def inject_tool_result(raw_result, max_tokens=500):
result_str = json.dumps(raw_result)
if estimate_tokens(result_str) > max_tokens:
# Truncate or summarize
if isinstance(raw_result, list):
return json.dumps({
"count": len(raw_result),
"first_5": raw_result[:5],
"note": f"Showing 5 of {len(raw_result)} results"
})
return result_str[:max_tokens * 4] # rough char estimate
return result_str
For long-running agents, maintain only recent tool interactions in the context while summarizing earlier ones:
def manage_tool_history(messages, max_tool_pairs=10):
tool_pairs = extract_tool_call_result_pairs(messages)
if len(tool_pairs) > max_tool_pairs:
old_pairs = tool_pairs[:-max_tool_pairs]
summary = summarize_tool_interactions(old_pairs)
messages = [messages[0]] # system message
messages.append({"role": "user", "content": f"Previous tool interactions summary: {summary}"})
messages.extend(flatten(tool_pairs[-max_tool_pairs:]))
return messages
Production tool systems require more than just the API call. A robust pipeline includes:
class ToolRegistry:
def __init__(self):
self._tools = {}
self._middleware = []
def register(self, name, func, schema, metadata=None):
self._tools[name] = {
"function": func,
"schema": schema,
"metadata": metadata or {}
}
def add_middleware(self, middleware_fn):
self._middleware.append(middleware_fn)
async def execute(self, tool_call):
tool = self._tools.get(tool_call.function.name)
if not tool:
return {"error": f"Unknown tool: {tool_call.function.name}"}
context = {"tool_call": tool_call, "tool": tool}
for middleware in self._middleware:
context = await middleware(context)
if context.get("short_circuit"):
return context["result"]
args = json.loads(tool_call.function.arguments)
return await tool["function"](**args)
Logging every tool call with timing, inputs, outputs, and errors is essential for debugging agent behavior:
async def logging_middleware(context):
start = time.time()
tool_call = context["tool_call"]
logger.info(f"Tool call: {tool_call.function.name}", extra={
"tool_name": tool_call.function.name,
"arguments": tool_call.function.arguments,
"call_id": tool_call.id
})
return context
async def timing_middleware(context):
context["start_time"] = time.time()
return context
Tools that call paid APIs need rate limiting and cost tracking:
class CostTracker:
def __init__(self, budget_per_session=1.0):
self.total_cost = 0.0
self.budget = budget_per_session
async def cost_middleware(self, context):
tool = context["tool"]
estimated_cost = tool["metadata"].get("cost_per_call", 0.0)
if self.total_cost + estimated_cost > self.budget:
context["short_circuit"] = True
context["result"] = {"error": "Budget exceeded for this session"}
return context
While function calling APIs solved the problem of how a model invokes tools, each provider implemented its own schema format, transport mechanism, and tool lifecycle. Anthropic's Model Context Protocol (MCP), introduced in late 2024, addresses this fragmentation by defining an open standard for how AI applications discover, connect to, and interact with external tools and data sources.
MCP follows a client-server architecture inspired by the Language Server Protocol (LSP) from the IDE world. An MCP server exposes a set of capabilities -- tools, resources (readable data), and prompts (reusable templates) -- through a standardized JSON-RPC 2.0 transport. An MCP client (typically an AI application or agent framework) connects to one or more servers and presents their capabilities to the model as available tools.
The key insight is the separation of concerns: tool implementation lives in the MCP server, tool selection and invocation are handled by the model through the client, and the protocol layer handles discovery, schema negotiation, and transport.
# Example: MCP server exposing a database query tool
from mcp.server import Server
from mcp.types import Tool
server = Server("database-server")
@server.tool()
async def query_database(sql: str, database: str = "production") -> str:
"""Execute a read-only SQL query against the specified database.
Args:
sql: The SQL query to execute (SELECT only)
database: Target database name
"""
# MCP handles schema generation from type hints and docstring
result = await db_pool.execute(sql, database=database)
return format_results(result)
# Transport: stdio for local, SSE/HTTP for remote
server.run(transport="stdio")
Before MCP, integrating a tool required writing provider-specific adapters. A Slack integration for OpenAI's API looked different from one for Anthropic's API, even though the underlying capability was identical. MCP makes tools portable: a single MCP server for Slack works with any MCP-compatible client, whether that client uses Claude, GPT, or a local model.
MCP servers have been built for databases (Postgres, SQLite), file systems, version control (GitHub, GitLab), communication platforms (Slack, email), and development tools (Docker, Kubernetes). This emerging ecosystem means that agent builders can compose capabilities from pre-built servers rather than implementing each integration from scratch -- a pattern that mirrors how microservices compose into larger systems, as explored in Article 26: Agent Architectures.
Unlike static tool definitions passed in the API's tools parameter, MCP supports dynamic discovery. A client can query a server for its available tools at runtime, and servers can add or remove tools based on context. This is particularly powerful for enterprise environments where available tools change based on user permissions, deployment environment, or time of day.
Early function calling demos featured three to five tools. Production agent systems routinely expose fifty or more. At this scale, including every tool definition in every request becomes impractical: it consumes context window tokens, increases latency, confuses the model with irrelevant options, and degrades selection accuracy.
The most effective approach borrows from RAG: embed tool descriptions into a vector store and retrieve only the relevant tools for each query.
from sentence_transformers import SentenceTransformer
import numpy as np
class ToolSelector:
def __init__(self, tools: list[dict]):
self.encoder = SentenceTransformer("all-MiniLM-L6-v2")
self.tools = tools
descriptions = [
f"{t['name']}: {t['description']}" for t in tools
]
self.embeddings = self.encoder.encode(descriptions)
def select(self, query: str, top_k: int = 8) -> list[dict]:
query_emb = self.encoder.encode([query])
scores = np.dot(self.embeddings, query_emb.T).flatten()
top_indices = np.argsort(scores)[-top_k:][::-1]
return [self.tools[i] for i in top_indices]
# Usage: only pass relevant tools to the model
selector = ToolSelector(all_tools)
relevant_tools = selector.select(user_message)
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=relevant_tools
)
This pattern reduces context usage while improving selection accuracy: models perform better when choosing from eight relevant tools than from sixty mostly-irrelevant ones.
An alternative to retrieval is a two-stage selection process. First, the model selects a tool category, then a second call with only that category's tools selects the specific tool:
categories = [
{"name": "communication", "description": "Email, Slack, SMS, notifications"},
{"name": "data_analysis", "description": "SQL queries, data visualization, statistics"},
{"name": "file_management", "description": "Read, write, search, organize files"},
]
# Stage 1: select category
category = select_category(user_query, categories)
# Stage 2: select and call specific tool from that category
category_tools = tool_registry.get_tools(category)
response = call_model_with_tools(messages, category_tools)
This mirrors how humans navigate complex tool sets -- you open the right application before looking for the specific feature.
Context-based filtering reduces the tool set without semantic search. Tools can be tagged with metadata (required permissions, applicable domains, preconditions) and filtered before inclusion:
def filter_tools(all_tools, context):
return [
t for t in all_tools
if context.user_role in t["metadata"].get("allowed_roles", ["all"])
and t["metadata"].get("enabled", True)
and context.environment in t["metadata"].get("environments", ["all"])
]
Evaluating tool selection accuracy at scale requires systematic benchmarks, as discussed in Article 30: Agent Evaluation.
Function calling traditionally means invoking structured APIs with typed parameters. Computer Use -- pioneered by Anthropic with Claude's computer use capability in 2024 -- introduces a fundamentally different class of tool interaction: the model controls a graphical user interface through screenshots, mouse clicks, and keyboard input.
Instead of calling a search_database(query) function, the model sees a screenshot of a desktop, identifies the browser's address bar, clicks on it, types a URL, reads the rendered page, and extracts information. The tool interface is minimal:
# Anthropic's computer use tool definition
computer_tool = {
"type": "computer_20241022",
"name": "computer",
"display_width_px": 1024,
"display_height_px": 768
}
# The model issues actions like:
# {"action": "click", "coordinate": [512, 384]}
# {"action": "type", "text": "quarterly revenue report"}
# {"action": "screenshot"} -- returns current screen state
# {"action": "key", "text": "Return"}
Computer Use is not a replacement for API-based function calling -- it is significantly slower and less reliable for tasks where a structured API exists. Its value lies in bridging the gap for applications that lack APIs: legacy enterprise software, desktop applications, web interfaces without public APIs, and workflows that span multiple GUI applications.
The interaction loop follows the same observe-act pattern as other agent architectures (see Article 26: Agent Architectures), but the observation is a screenshot and the action is a physical input event rather than a function call. This makes error detection harder -- the model must visually confirm that its action had the intended effect, introducing a new class of failure modes that pure API tools avoid.
GUI-based tool use amplifies security concerns because the model has broader access than any single API would grant. A model controlling a desktop could navigate to unintended applications, execute system commands through a terminal, or interact with sensitive interfaces. Sandboxing via virtual machines or containers, restricting the accessible screen region, and maintaining human-in-the-loop confirmation for destructive actions are essential safeguards (see Article 12: Adversarial Prompting for related attack vectors).
Standard function calling follows a request-response pattern: the model emits a complete tool call, the runtime executes it, and the result is returned. For long-running tools -- database queries over large datasets, multi-step API workflows, file processing -- the user sees nothing until execution completes. Streaming tool calls address this gap.
Most providers support streaming the assistant's response token-by-token, including the tool call arguments. This lets the client display the tool call as it forms:
stream = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
stream=True
)
tool_call_chunks = {}
for chunk in stream:
delta = chunk.choices[0].delta
if delta.tool_calls:
for tc in delta.tool_calls:
idx = tc.index
if idx not in tool_call_chunks:
tool_call_chunks[idx] = {"name": "", "arguments": ""}
if tc.function.name:
tool_call_chunks[idx]["name"] += tc.function.name
if tc.function.arguments:
tool_call_chunks[idx]["arguments"] += tc.function.arguments
# Display partial arguments to user in real-time
display_partial_tool_call(tool_call_chunks[idx])
For tools that produce incremental output (streaming database rows, reading a large file, multi-page web scraping), the runtime can stream partial results back to the user while accumulating the full result for the model:
async def execute_streaming_tool(tool_call, user_stream):
args = json.loads(tool_call.function.arguments)
accumulated = []
async for partial_result in streaming_query(**args):
accumulated.append(partial_result)
# Stream partial results to the user's UI immediately
await user_stream.send({"partial": partial_result, "tool_call_id": tool_call.id})
# Return the full result to the model for reasoning
return {"complete": True, "data": accumulated}
This dual-channel approach -- streaming to the user for responsiveness while batching for the model -- significantly improves the perceived performance of tool-heavy agents.
Understanding how models are trained for function calling illuminates why they sometimes fail and how to work around limitations. Base language models have no inherent concept of tool invocation; this capability is added through supervised fine-tuning on curated datasets of tool-use conversations.
Several significant efforts have produced open training datasets and fine-tuned models for function calling:
Gorilla (Patil et al., 2023) trained a LLaMA-based model on API documentation from TorchHub, TensorHub, and HuggingFace. Its key contribution was demonstrating that fine-tuning on API docs with a retrieval system drastically reduced hallucination of non-existent API parameters. Gorilla's insight that retrieval-augmented training improves tool accuracy directly influenced the retrieval-augmented tool selection patterns discussed earlier in this article.
ToolLLM (Qin et al., 2023) scaled to over 16,000 real-world REST APIs from RapidAPI. The researchers used ChatGPT to generate multi-tool usage scenarios, then trained on execution traces that included error recovery. ToolLLM demonstrated that training on realistic tool-use trajectories -- complete sequences of tool calls including failures and retries -- produced models far more robust than those trained only on single successful calls.
Glaive took a synthetic data approach, generating millions of function calling examples across diverse schemas to fine-tune models that generalize well to unseen tool definitions. This showed that schema diversity in training data matters more than volume.
These training approaches explain common failure modes. Models struggle with tools whose schemas differ significantly from patterns seen in training: unusual parameter names, deeply nested objects, or unconventional description formats. When a model repeatedly misuses a tool, the fix is often to redesign the schema to match patterns the model has been trained on -- shorter descriptions, flatter parameter structures, and conventional naming.
Models also exhibit training-data biases in tool selection. A model trained primarily on single-tool examples may under-utilize parallel calling. One trained on ReAct-style traces may add unnecessary reasoning steps before straightforward tool calls. Recognizing these biases helps when designing evaluation suites for tool-use accuracy, as discussed in Article 30: Agent Evaluation.
The function calling interface is deceptively simple -- a few API parameters and a JSON Schema. But building reliable, secure, and efficient tool pipelines on top of it requires careful engineering across validation, execution, error handling, and observability. Function calling is the connective tissue of agent systems: it bridges the gap between the model's reasoning (discussed in Article 26: Agent Architectures) and the structured outputs (discussed in Article 10: Structured Output) that make those systems reliable.