Cache the prompt, not the response - why most LLM caching fails

Key takeaways

Prompt caching reduces costs more than response caching - Cached tokens cost 90% less than regular tokens, making this the highest ROI optimization for LLM applications
Semantic similarity beats exact matching - Systems using embedding-based cache matching achieve 87% hit rates compared to 30% with exact matching
The math works fast - Organizations commonly see 60-90% cost reductions and 40-50% latency improvements within weeks of implementing intelligent caching
Cache invalidation is simpler than you think - Time-based expiration handles most cases, with content-triggered updates for the rest
Need help implementing these strategies? Let's discuss your specific challenges.

Your LLM API bill doubled last month. Again.

You added caching weeks ago, but it barely made a dent. Here’s what’s happening: you’re probably caching responses when you should be caching prompts. The economics are completely different, and most teams get this backwards when implementing llm caching strategies.

Why response caching misses the point

When I talk to teams struggling with LLM costs, they’ve usually built some version of response caching. User asks a question, you hash it, check if you’ve seen it before, serve the cached answer. Makes sense, right?

Wrong place to optimize.

The problem is that exact question matching gives you terrible hit rates. Someone asks “How do I reset my password?” and you cache the response. Next person asks “What’s the password reset process?” - different question, cache miss, full API call. Your cache hit rate sits around 30% if you’re lucky.

Research from Facebook AI found something better: caching the prompt processing, not the final response. The expensive part isn’t generating the answer. It’s the model processing your system instructions, reference documents, and context every single time.

That’s what you cache.

How semantic caching changes the math

Here’s where llm caching strategies get interesting. Instead of exact string matching, you use embeddings to find semantically similar prompts.

User query comes in. Convert it to an embedding. Search your cache for similar embeddings. If you find a match above your threshold - usually 0.85 to 0.95 cosine similarity - you’ve got a hit. The model has already processed similar context, so you reuse that work.

GPTCache pioneered this approach as an open-source tool, and the results changed how people think about LLM optimization. Teams went from 30% hit rates to 87% hit rates. Same user base, same questions, just smarter matching.

The technical implementation is straightforward. You need three things: an embedding model to convert queries to vectors, a vector store to search those embeddings fast, and a threshold to decide what counts as similar enough. Redis handles all three with median latency around 79ms for the full operation.

The cost breakdown that gets budget approval

Let me show you the numbers that make finance teams pay attention.

Without caching, every API call to something like Claude processes your full prompt. System instructions, reference docs, conversation history, the works. You pay full price for every token, every time.

With prompt caching enabled, Anthropic charges a modest premium to write something to cache but significantly less to read it back. So caching a prompt costs you slightly more upfront, but reading from cache costs substantially less than reprocessing.

Do the math on a typical RAG application. You’ve got system instructions, reference documents, and maybe some examples. That’s your static context - same for every query. Cache that once, reuse it hundreds of times. Your cost drops 60-90% depending on how much of your prompt is reusable.

Character.ai demonstrated this at scale. They built caching into their infrastructure and scaled to 30,000 messages per second. Their secret wasn’t exotic optimization. It was recognizing that most prompts share 80% of their content.

One company I came across saved significant costs quarterly on their OpenAI bill by implementing semantic caching. Analysis showed that roughly 30-40% of their LLM requests were similar enough to previous questions that caching made sense.

Cache invalidation without overthinking it

Everyone quotes the “two hard problems in computer science” joke about cache invalidation. For llm caching strategies specifically, it’s simpler than you think.

Most cached responses can use time-based expiration. Set a reasonable TTL - maybe 5 minutes for rapidly changing data, an hour for relatively stable content. Anthropic’s prompt caching offers both 5-minute and 1-hour cache durations. Pick based on how fresh your data needs to be.

For content that changes on events rather than time, you need content-triggered invalidation. Document gets updated? Clear cache entries that reference it. Model version changes? Flush everything and start fresh. Cache management strategies work the same way they do in traditional systems - you just need to think about what makes a cached LLM response stale.

The trick is monitoring your hit rates and adjusting thresholds. Start conservative - maybe 0.90 similarity for a cache hit. If you see too many misses, lower it to 0.85. If you see complaints about irrelevant responses, raise it to 0.95. Specialized systems like vCache can learn optimal thresholds automatically, but manual tuning works fine when you’re starting out.

What works in production

I’ve looked at enough implementations to see patterns in what succeeds versus what fails.

The teams seeing real results use multi-tier approaches. They combine exact matching for identical queries, semantic matching for similar questions, and fall back to full API calls when nothing matches. Redis-based architectures deliver this with around 389ms average response time - roughly three times faster than non-cached approaches.

They also track the right metrics. Cache hit rate matters, but so does cost per query and user-perceived latency. Best practice monitoring includes tracking these across different query types because your FAQ traffic and complex research queries need different llm caching strategies.

What doesn’t work: trying to cache everything. Some queries are truly unique. Some contexts change so fast that caching adds complexity without benefit. The teams doing this well identify their repetitive traffic - usually 40-60% of queries - and focus there.

Enterprise implementations commonly achieve 20-35% overall cost reduction even with conservative caching policies. The ones being aggressive about it hit 60-90% for applications with stable contexts like documentation assistants or customer support bots.

Start small. Pick your highest-volume endpoint. Add semantic caching. Measure for a week. You’ll know pretty quickly if this is worth expanding.

The pattern that emerges is simple: prompt caching is the highest ROI optimization for LLM applications. Not prompt engineering. Not model fine-tuning. Not switching providers. Caching what you are already sending anyway.

Most teams leave this money on the table because they overthink it. The infrastructure is mature, the costs are proven, and the implementation is straightforward. You just need to cache the right thing.