MCP Performance: What Slow Tools Cost You

June 18, 2025 · 7 min read

MCPBundles

Our bundle search tool was taking 12 seconds at P95. Users would ask Claude to "find the Slack integration," watch nothing happen, then ask again. Claude would make the same call twice, wait 24 seconds total, and users would close the tab thinking our service was down.

The tool worked perfectly. It just worked slowly. And in the world of AI assistants, slow might as well be broken.

Why Performance Matters More Than You Think

When you test an MCP tool manually, 3-4 seconds feels fine. You run it, see results, move on. But when Claude is using your tools, that 3-4 seconds compounds:

Claude decides which tool to use (1-2s of inference)
Your tool executes (3-4s)
Claude reads the response and generates an answer (2-3s)

Total: 6-9 seconds for one tool call. If Claude needs to call two tools to answer a question, you're at 12-18 seconds. Users start wondering if something broke.

We learned this the hard way when our bundle search averaged 2.3 seconds but hit 12 seconds at P95. Most requests felt okay, but enough were slow that users lost confidence in the whole system.

Set Latency Budgets Before You Optimize

Before you start caching and optimizing, decide what "fast enough" means:

Local stdio servers: < 500ms for simple tools, < 2s for complex ones. Users expect local tools to feel instant.

Remote HTTP servers: < 1s for simple lookups, < 3s for searches or mutations. Network overhead is already ~100-200ms, so every bit counts.

Our budgets:

Search tools: P50 < 800ms, P95 < 2s
Get-by-ID tools: P50 < 300ms, P95 < 800ms
Create/update tools: P50 < 1s, P95 < 3s

If you're consistently hitting your P95 budget, that becomes your new P50. Don't set aspirational numbers—set ones you can actually meet.

Add Observability First, Optimize Second

You can't fix what you can't measure. Before we optimized anything, we instrumented our tools with three things:

1. Correlation IDs

Every tool call gets a unique ID that flows through all logs:

import uuid
import logging

logger = logging.getLogger(__name__)

@mcp.tool(description="Search bundles")
async def search_bundles(query: str) -> dict:
    request_id = str(uuid.uuid4())[:8]  # Short ID for readability
    logger.info(f"[{request_id}] Starting search: {query}")
    
    # ... your code ...
    
    logger.info(f"[{request_id}] Completed in {elapsed}ms")
    return results

Now when something goes wrong, we can grep logs for that request ID and see the entire flow.

2. Timing breakdowns

Don't just log total time—log each step:

import time

@mcp.tool(description="Search bundles")
async def search_bundles(query: str) -> dict:
    request_id = str(uuid.uuid4())[:8]
    start = time.time()
    
    # Database query
    db_start = time.time()
    raw_results = await db.search(query)
    db_elapsed = (time.time() - db_start) * 1000
    logger.info(f"[{request_id}] DB query: {db_elapsed:.0f}ms")
    
    # Filter and rank
    filter_start = time.time()
    filtered = apply_filters(raw_results)
    filter_elapsed = (time.time() - filter_start) * 1000
    logger.info(f"[{request_id}] Filtering: {filter_elapsed:.0f}ms")
    
    # Format response
    format_start = time.time()
    results = format_results(filtered)
    format_elapsed = (time.time() - format_start) * 1000
    logger.info(f"[{request_id}] Formatting: {format_elapsed:.0f}ms")
    
    total_elapsed = (time.time() - start) * 1000
    logger.info(f"[{request_id}] Total: {total_elapsed:.0f}ms")
    
    return results

This told us our DB query was 200ms, filtering was 50ms, but formatting took 1.8 seconds because we were serializing huge JSON objects. Fixed the formatting, cut latency by 80%.

3. Success/failure tracking

Count everything:

from collections import defaultdict

# Simple in-memory counters (use Prometheus/StatsD in production)
METRICS = defaultdict(int)

@mcp.tool(description="Search bundles")
async def search_bundles(query: str) -> dict:
    try:
        results = await do_search(query)
        METRICS["search_success"] += 1
        METRICS[f"search_results_{len(results)}"] += 1
        return results
    except TimeoutError:
        METRICS["search_timeout"] += 1
        logger.error(f"Search timed out: {query}")
        raise
    except Exception as e:
        METRICS["search_error"] += 1
        logger.error(f"Search failed: {e}")
        raise

Now we know our error rate, timeout frequency, and result distribution.

What Actually Made Things Faster

After instrumenting everything, we found our bottlenecks and fixed them:

1. Return IDs, not objects

Our original search returned full bundle objects—100KB responses with descriptions, tools, provider details, everything. Claude didn't need all that to answer "find the Slack integration."

Before:

return {
    "bundles": [full_bundle_dict for bundle in results]  # 100KB
}

After:

return {
    "count": len(results),
    "bundle_ids": [b.id for b in results],
    "summaries": [{
        "id": b.id,
        "name": b.name,
        "description": b.description[:100]  # Just first 100 chars
    } for b in results]
}  # 8KB

Cut response size by 92%, serialization time from 1.8s to 200ms.

2. Cache stable data

Bundle metadata doesn't change often. We added a 5-minute cache:

from functools import lru_cache
import time

@lru_cache(maxsize=1000)
def get_bundle_metadata(bundle_id: str, cache_key: int):
    # cache_key is timestamp // 300 (5 minute buckets)
    return fetch_from_db(bundle_id)

@mcp.tool(description="Get bundle details")
async def get_bundle(bundle_id: str) -> dict:
    cache_key = int(time.time() // 300)
    return get_bundle_metadata(bundle_id, cache_key)

Reduced database load by 60% and cut latency for repeated lookups from 300ms to 5ms.

3. Add timeouts everywhere

We had a tool that called an external API without a timeout. When that API was slow, our tool would hang for 30+ seconds.

import asyncio

@mcp.tool(description="Check bundle health")
async def check_health(bundle_id: str) -> dict:
    try:
        # 5 second timeout
        result = await asyncio.wait_for(
            external_api.check(bundle_id),
            timeout=5.0
        )
        return result
    except asyncio.TimeoutError:
        logger.warning(f"Health check timed out: {bundle_id}")
        return {"status": "timeout", "message": "Health check timed out"}

Now slow calls fail fast with a useful error instead of hanging.

4. Consolidate list endpoints

We had list_bundles, list_providers, list_tools, search_bundles—four tools that all did similar things. Claude would pick the wrong one, get bad results, then try another.

We replaced all four with one powerful search:

@mcp.tool(description="Search for bundles, providers, or tools by name or description")
async def search(
    query: str,
    type: Literal["bundle", "provider", "tool", "all"] = "all",
    limit: int = 20
) -> dict:
    # One smart search endpoint
    return smart_search(query, type, limit)

Claude uses the right tool every time now, and we only maintain one code path.

When Things Go Wrong

Retries with exponential backoff:

import asyncio

async def call_with_retry(func, max_attempts=3):
    for attempt in range(max_attempts):
        try:
            return await func()
        except Exception as e:
            if attempt == max_attempts - 1:
                raise
            wait = (2 ** attempt) + random.random()  # Exponential + jitter
            logger.warning(f"Attempt {attempt + 1} failed, retrying in {wait:.1f}s")
            await asyncio.sleep(wait)

Graceful degradation:

When our database is slow, we return cached results with a warning:

@mcp.tool(description="Search bundles")
async def search_bundles(query: str) -> dict:
    try:
        results = await db.search(query)
    except TimeoutError:
        logger.warning("DB timeout, returning cached results")
        results = cache.get(f"search:{query}", [])
        return {
            "results": results,
            "warning": "Using cached results due to high load"
        }

Claude can still answer the question, and users know why the data might be stale.

What We Track in Production

Our production MCP server exposes Prometheus metrics:

mcp_tool_calls_total{tool="search_bundles", status="success"} - Total calls by tool and status
mcp_tool_duration_seconds{tool="search_bundles", quantile="0.95"} - Latency percentiles
mcp_tool_payload_bytes{tool="search_bundles", direction="response"} - Response sizes
mcp_tool_errors_total{tool="search_bundles", error="timeout"} - Error types

We alert when:

P95 latency > 3s for any tool
Error rate > 5% over 5 minutes
Any tool timeout rate > 1%

These alerts caught a database deadlock at 2am that would have broken our service for morning users.

Key Takeaways

Slow tools feel broken—users give up after 10-15 seconds
Set realistic budgets before optimizing—< 2s for most tools is achievable
Instrument everything with correlation IDs, timing breakdowns, and counters
Return IDs, not full objects—cut response sizes by 80-90%
Cache stable data with short TTLs—5 minute caches are safe for most metadata
Add timeouts to all external calls—fail fast beats hanging
Consolidate similar tools into one powerful search—reduces confusion
Monitor in production—you can't fix problems you don't see

Resources

Model Context Protocol Deep Dive – Architecture and performance considerations
FastMCP GitHub – Built-in timing and logging utilities
Anthropic MCP Docs – Best practices

Fast tools get used. Slow tools get blamed. Make yours fast.

Why Performance Matters More Than You Think​

Set Latency Budgets Before You Optimize​

Add Observability First, Optimize Second​

What Actually Made Things Faster​

When Things Go Wrong​

What We Track in Production​

Key Takeaways​

Resources​