AI

Cache the prompt, not the response - why most LLM caching fails

Your LLM API bills are eating your budget because you are caching the wrong thing. Most teams cache responses when they should cache prompts. Anthropic's prompt caching cuts costs by up to 90% and reduces latency by 85% by reusing processed context instead of reprocessing it every time.

Your LLM API bills are eating your budget because you are caching the wrong thing. Most teams cache responses when they should cache prompts. Anthropic's prompt caching cuts costs by up to 90% and reduces latency by 85% by reusing processed context instead of reprocessing it every time.

Key takeaways

  • Prompt caching reduces costs more than response caching - Anthropic's prompt caching delivers up to 90% cost reduction and 85% latency reduction for long prompts
  • Semantic similarity beats exact matching - Redis-based semantic caches achieve 61-69% hit rates with positive hit rates exceeding 97%
  • Multi-tier caching changes everything - Combining semantic caching, prefix caching, and full inference can reduce costs by 80% or more versus naive implementation
  • Cache invalidation is simpler than you think - Time-based expiration handles most cases, with content-triggered updates for the rest
  • Need help implementing these strategies? Let's discuss your specific challenges.

Your LLM API bill doubled last month. Again.

You added caching weeks ago, but it barely made a dent. Here’s what’s happening: you’re probably caching responses when you should be caching prompts. The economics are completely different, and most teams get this backwards when implementing llm caching strategies.

Why response caching misses the point

When I talk to teams struggling with LLM costs, they’ve usually built some version of response caching. User asks a question, you hash it, check if you’ve seen it before, serve the cached answer. Makes sense, right?

Wrong place to optimize.

The problem is that exact question matching gives you terrible hit rates. Someone asks “How do I reset my password?” and you cache the response. Next person asks “What’s the password reset process?” - different question, cache miss, full API call. Your cache hit rate sits around 30% if you’re lucky.

Recent research found that over 30% of LLM queries are semantically similar - meaning you’re reprocessing the same context over and over. The expensive part isn’t generating the answer. It’s the model processing your system instructions, reference documents, and context every single time.

That’s what you cache.

How semantic caching changes the math

Here’s where llm caching strategies get interesting. Instead of exact string matching, you use embeddings to find semantically similar prompts.

User query comes in. Convert it to an embedding. Search your cache for similar embeddings. If you find a match above your threshold - usually 0.85 to 0.95 cosine similarity - you’ve got a hit. The model has already processed similar context, so you reuse that work.

GPTCache pioneered this approach as an open-source tool, integrating with LangChain and LlamaIndex. But it has limitations - the default SQLite backend struggles in production, and its fixed 0.8 similarity threshold doesn’t generalize well across different use cases. Newer alternatives like GenerativeCache run about 9x faster and adaptively vary thresholds for different content types. MeanCache adds privacy-preserving federated learning and produces fewer false hits.

The technical implementation is straightforward. You need three things: an embedding model to convert queries to vectors, a vector store to search those embeddings fast, and a threshold to decide what counts as similar enough. Redis-based semantic caching can reduce API calls by up to 68.8%, with cache hit rates of 61-69% and positive hit rates exceeding 97%.

The cost breakdown that gets budget approval

Let me show you the numbers that make finance teams pay attention.

Without caching, every API call to something like Claude processes your full prompt. System instructions, reference docs, conversation history, the works. You pay full price for every token, every time.

Anthropic’s prompt caching delivers up to 90% cost reduction and 85% latency reduction for long prompts. OpenAI now offers automatic caching with 50% cost reduction enabled by default. A modest premium to write to cache, but substantially less to read it back.

Do the math on a typical RAG application. You’ve got system instructions, reference documents, and maybe some examples. That’s your static context - same for every query. Cache that once, reuse it hundreds of times. The recommended approach in 2025-2026 is multi-tier architecture: semantic cache first, then prefix cache, then full inference. Combined savings can exceed 80% versus naive implementation.

Character.ai demonstrated this at scale. They built caching into their infrastructure and scaled to 30,000 messages per second. Their secret wasn’t exotic optimization. It was recognizing that most prompts share 80% of their content.

A chat application with stable system prompts, consistent document retrieval, and repetitive user questions can cache 70% or more of input tokens through prefix caching while semantic caching handles 30% of queries outright. Commercial solutions like Portkey claim 99% accuracy with approximately 20% hit rate, while for RAG applications hit rates range from 18-60%.

Cache invalidation without overthinking it

Everyone quotes the “two hard problems in computer science” joke about cache invalidation. For llm caching strategies specifically, it’s simpler than you think.

Most cached responses can use time-based expiration. Set a reasonable TTL - maybe 5 minutes for rapidly changing data, an hour for relatively stable content. Research on LLM caching confirms that TTL-based freshness is often insufficient for rapidly changing data, but works well for stable contexts like documentation or reference material.

For content that changes on events rather than time, you need content-triggered invalidation. Document gets updated? Clear cache entries that reference it. Model version changes? Flush everything and start fresh. The open challenges in this space include handling context-dependent multi-turn interactions and the computational cost of embedding-similarity matching at scale.

The trick is monitoring your hit rates and adjusting thresholds. Start conservative - maybe 0.90 similarity for a cache hit. If you see too many misses, lower it to 0.85. If you see complaints about irrelevant responses, raise it to 0.95. Tools like GenerativeCache adaptively vary thresholds for different content types, but manual tuning works fine when you’re starting out.

What works in production

I’ve looked at enough implementations to see patterns in what succeeds versus what fails.

The teams seeing real results use multi-tier approaches. They combine exact matching for identical queries, semantic matching for similar questions, and fall back to full API calls when nothing matches. Cloud providers have caught on - Microsoft offers Azure Cosmos DB for semantic caching, Google has Vertex AI with Vector Search, and AWS provides Titan embedding with MemoryDB.

They also track the right metrics. Cache hit rate matters, but so does cost per query and user-perceived latency. Helicone, which has processed over 2 billion LLM interactions, reports that built-in caching typically reduces API costs by 20-30%. Their proxy-based integration adds only 50-80ms average latency.

What doesn’t work: trying to cache everything. Some queries are truly unique. Some contexts change so fast that caching adds complexity without benefit. The teams doing this well identify their repetitive traffic - over 30% of LLM queries are semantically similar - and focus there.

Multi-tier caching with semantic caching, prefix caching, and full inference can reduce costs by 80% or more. For applications with stable contexts like documentation assistants or customer support bots, the savings are even higher - Anthropic’s approach delivers up to 90% cost reduction for long prompts.

Start small. Pick your highest-volume endpoint. Add semantic caching. Measure for a week. You’ll know pretty quickly if this is worth expanding.

The pattern that emerges is simple: prompt caching is the highest ROI optimization for LLM applications. Not prompt engineering. Not model fine-tuning. Not switching providers. Caching what you’re already sending anyway.

Most teams leave this money on the table because they overthink it. The infrastructure is mature - every major provider now offers some form of caching, from Anthropic’s 90% cost reduction to OpenAI’s automatic 50% savings. The implementation is straightforward. You just need to cache the right thing.

About the Author

Amit Kothari is an experienced consultant, advisor, coach, and educator specializing in AI and operations for executives and their companies. With 25+ years of experience and as the founder of Tallyfy (raised $3.6m), he helps mid-size companies identify, plan, and implement practical AI solutions that actually work. Originally British and now based in St. Louis, MO, Amit combines deep technical expertise with real-world business understanding.

Disclaimer: The content in this article represents personal opinions based on extensive research and practical experience. While every effort has been made to ensure accuracy through data analysis and source verification, this should not be considered professional advice. Always consult with qualified professionals for decisions specific to your situation.