Skip to main content

MCP Performance: What Slow Tools Cost You

· 7 min read
MCPBundles

Our bundle search tool was taking 12 seconds at P95. Users would ask Claude to "find the Slack integration," watch nothing happen, then ask again. Claude would make the same call twice, wait 24 seconds total, and users would close the tab thinking our service was down.

The tool worked perfectly. It just worked slowly. And in the world of AI assistants, slow might as well be broken.

Why Performance Matters More Than You Think

When you test an MCP tool manually, 3-4 seconds feels fine. You run it, see results, move on. But when Claude is using your tools, that 3-4 seconds compounds:

  1. Claude decides which tool to use (1-2s of inference)
  2. Your tool executes (3-4s)
  3. Claude reads the response and generates an answer (2-3s)

Total: 6-9 seconds for one tool call. If Claude needs to call two tools to answer a question, you're at 12-18 seconds. Users start wondering if something broke.

We learned this the hard way when our bundle search averaged 2.3 seconds but hit 12 seconds at P95. Most requests felt okay, but enough were slow that users lost confidence in the whole system.

Set Latency Budgets Before You Optimize

Before you start caching and optimizing, decide what "fast enough" means:

Local stdio servers: < 500ms for simple tools, < 2s for complex ones. Users expect local tools to feel instant.

Remote HTTP servers: < 1s for simple lookups, < 3s for searches or mutations. Network overhead is already ~100-200ms, so every bit counts.

Our budgets:

  • Search tools: P50 < 800ms, P95 < 2s
  • Get-by-ID tools: P50 < 300ms, P95 < 800ms
  • Create/update tools: P50 < 1s, P95 < 3s

If you're consistently hitting your P95 budget, that becomes your new P50. Don't set aspirational numbers—set ones you can actually meet.

Add Observability First, Optimize Second

You can't fix what you can't measure. Before we optimized anything, we instrumented our tools with three things:

1. Correlation IDs

Every tool call gets a unique ID that flows through all logs:

import uuid
import logging

logger = logging.getLogger(__name__)

@mcp.tool(description="Search bundles")
async def search_bundles(query: str) -> dict:
request_id = str(uuid.uuid4())[:8] # Short ID for readability
logger.info(f"[{request_id}] Starting search: {query}")

# ... your code ...

logger.info(f"[{request_id}] Completed in {elapsed}ms")
return results

Now when something goes wrong, we can grep logs for that request ID and see the entire flow.

2. Timing breakdowns

Don't just log total time—log each step:

import time

@mcp.tool(description="Search bundles")
async def search_bundles(query: str) -> dict:
request_id = str(uuid.uuid4())[:8]
start = time.time()

# Database query
db_start = time.time()
raw_results = await db.search(query)
db_elapsed = (time.time() - db_start) * 1000
logger.info(f"[{request_id}] DB query: {db_elapsed:.0f}ms")

# Filter and rank
filter_start = time.time()
filtered = apply_filters(raw_results)
filter_elapsed = (time.time() - filter_start) * 1000
logger.info(f"[{request_id}] Filtering: {filter_elapsed:.0f}ms")

# Format response
format_start = time.time()
results = format_results(filtered)
format_elapsed = (time.time() - format_start) * 1000
logger.info(f"[{request_id}] Formatting: {format_elapsed:.0f}ms")

total_elapsed = (time.time() - start) * 1000
logger.info(f"[{request_id}] Total: {total_elapsed:.0f}ms")

return results

This told us our DB query was 200ms, filtering was 50ms, but formatting took 1.8 seconds because we were serializing huge JSON objects. Fixed the formatting, cut latency by 80%.

3. Success/failure tracking

Count everything:

from collections import defaultdict

# Simple in-memory counters (use Prometheus/StatsD in production)
METRICS = defaultdict(int)

@mcp.tool(description="Search bundles")
async def search_bundles(query: str) -> dict:
try:
results = await do_search(query)
METRICS["search_success"] += 1
METRICS[f"search_results_{len(results)}"] += 1
return results
except TimeoutError:
METRICS["search_timeout"] += 1
logger.error(f"Search timed out: {query}")
raise
except Exception as e:
METRICS["search_error"] += 1
logger.error(f"Search failed: {e}")
raise

Now we know our error rate, timeout frequency, and result distribution.

What Actually Made Things Faster

After instrumenting everything, we found our bottlenecks and fixed them:

1. Return IDs, not objects

Our original search returned full bundle objects—100KB responses with descriptions, tools, provider details, everything. Claude didn't need all that to answer "find the Slack integration."

Before:

return {
"bundles": [full_bundle_dict for bundle in results] # 100KB
}

After:

return {
"count": len(results),
"bundle_ids": [b.id for b in results],
"summaries": [{
"id": b.id,
"name": b.name,
"description": b.description[:100] # Just first 100 chars
} for b in results]
} # 8KB

Cut response size by 92%, serialization time from 1.8s to 200ms.

2. Cache stable data

Bundle metadata doesn't change often. We added a 5-minute cache:

from functools import lru_cache
import time

@lru_cache(maxsize=1000)
def get_bundle_metadata(bundle_id: str, cache_key: int):
# cache_key is timestamp // 300 (5 minute buckets)
return fetch_from_db(bundle_id)

@mcp.tool(description="Get bundle details")
async def get_bundle(bundle_id: str) -> dict:
cache_key = int(time.time() // 300)
return get_bundle_metadata(bundle_id, cache_key)

Reduced database load by 60% and cut latency for repeated lookups from 300ms to 5ms.

3. Add timeouts everywhere

We had a tool that called an external API without a timeout. When that API was slow, our tool would hang for 30+ seconds.

import asyncio

@mcp.tool(description="Check bundle health")
async def check_health(bundle_id: str) -> dict:
try:
# 5 second timeout
result = await asyncio.wait_for(
external_api.check(bundle_id),
timeout=5.0
)
return result
except asyncio.TimeoutError:
logger.warning(f"Health check timed out: {bundle_id}")
return {"status": "timeout", "message": "Health check timed out"}

Now slow calls fail fast with a useful error instead of hanging.

4. Consolidate list endpoints

We had list_bundles, list_providers, list_tools, search_bundles—four tools that all did similar things. Claude would pick the wrong one, get bad results, then try another.

We replaced all four with one powerful search:

@mcp.tool(description="Search for bundles, providers, or tools by name or description")
async def search(
query: str,
type: Literal["bundle", "provider", "tool", "all"] = "all",
limit: int = 20
) -> dict:
# One smart search endpoint
return smart_search(query, type, limit)

Claude uses the right tool every time now, and we only maintain one code path.

When Things Go Wrong

Retries with exponential backoff:

import asyncio

async def call_with_retry(func, max_attempts=3):
for attempt in range(max_attempts):
try:
return await func()
except Exception as e:
if attempt == max_attempts - 1:
raise
wait = (2 ** attempt) + random.random() # Exponential + jitter
logger.warning(f"Attempt {attempt + 1} failed, retrying in {wait:.1f}s")
await asyncio.sleep(wait)

Graceful degradation:

When our database is slow, we return cached results with a warning:

@mcp.tool(description="Search bundles")
async def search_bundles(query: str) -> dict:
try:
results = await db.search(query)
except TimeoutError:
logger.warning("DB timeout, returning cached results")
results = cache.get(f"search:{query}", [])
return {
"results": results,
"warning": "Using cached results due to high load"
}

Claude can still answer the question, and users know why the data might be stale.

What We Track in Production

Our production MCP server exposes Prometheus metrics:

  • mcp_tool_calls_total{tool="search_bundles", status="success"} - Total calls by tool and status
  • mcp_tool_duration_seconds{tool="search_bundles", quantile="0.95"} - Latency percentiles
  • mcp_tool_payload_bytes{tool="search_bundles", direction="response"} - Response sizes
  • mcp_tool_errors_total{tool="search_bundles", error="timeout"} - Error types

We alert when:

  • P95 latency > 3s for any tool
  • Error rate > 5% over 5 minutes
  • Any tool timeout rate > 1%

These alerts caught a database deadlock at 2am that would have broken our service for morning users.

Key Takeaways

  • Slow tools feel broken—users give up after 10-15 seconds
  • Set realistic budgets before optimizing—< 2s for most tools is achievable
  • Instrument everything with correlation IDs, timing breakdowns, and counters
  • Return IDs, not full objects—cut response sizes by 80-90%
  • Cache stable data with short TTLs—5 minute caches are safe for most metadata
  • Add timeouts to all external calls—fail fast beats hanging
  • Consolidate similar tools into one powerful search—reduces confusion
  • Monitor in production—you can't fix problems you don't see

Resources

Fast tools get used. Slow tools get blamed. Make yours fast.