Settings

Language

Why Your Semantic Cache Is Returning Wrong Answers

L
LemonData
·March 5, 2026·183 views
#semantic-cache#embeddings#llm-infrastructure#production-debugging
Why Your Semantic Cache Is Returning Wrong Answers

A user reported that our translation plugin was returning the same cached result for every request, regardless of input. We investigated and found something worse: 95% of all semantic cache hits across our platform were false positives. 199 different translation requests, 198 unique request bodies, one cached response served to all of them.

The Bug Report

The report was simple: "I disabled semantic cache, but every translation returns the same result."

Three request IDs, three different translation segments, identical cached responses. The request bodies ranged from 1,564 to 8,676 bytes. The cached response ID was the same across all of them: chatcmpl-DG6J03nhdvcF7Ek0C8rJkjh7lN9pF.

First suspicion: the user's cache settings weren't being applied. That turned out to be a separate data-source sync bug (the admin panel wrote to one table, the API gateway read from another). But fixing that only solved half the problem. Even with cache enabled and working correctly, the semantic cache was matching requests that should never match.

The Production Data

We pulled 24 hours of cache hit data from ClickHouse. The numbers were bad.

Model Total Requests Cache Hits Unique Requests Unique Responses Hit Rate
gpt-4.1-nano 200 199 198 1 99.5%
glm-4.6-thinking 100 38 13 1 38%
gpt-5-nano 31 29 28 2 93.5%
gpt-oss-120b 18 17 17 1 94.4%
qwen3-vl-flash 17 16 16 1 94.1%

198 unique translation requests, all returning the same single cached response. That's not a cache. That's a broken function that returns a constant.

Every affected model shared two traits: all requests came from a single user, and all used a fixed system prompt template with varying user content.

Why Embeddings Fail on Structured Input

The translation plugin sends requests like this:

System: "Act as a translation API. Output a single raw JSON object only.
         Input: {"targetLanguage":"<lang>","title":"...","segments":[...]}"

User:   {"targetLanguage":"zh","title":"Product Page",
         "description":"Translate product descriptions",
         "tone":"formal",
         "segments":[{"text":"actual varying content here"}]}

The system prompt is identical across all requests. The user message is a JSON object where targetLanguage, title, description, and tone are fixed. Only segments[].text changes.

When our semantic cache extracts text for embedding, it concatenates the system prompt and user message. The fixed template accounts for roughly 80% of the text. The embedding model (all-mpnet-base-v2, 768 dimensions) compresses this into a vector where the template structure dominates. The actual translation content barely moves the needle.

Result: cosine similarity between "translate 'Hello world'" and "translate 'The quarterly financial report shows a 15% increase in revenue'" exceeds 0.95. Our threshold is 0.95. Every translation request matches the first cached entry.

Digging through the logs, we found three ways this breaks:

The translation plugin is the worst offender. Fixed JSON keys and values drown out the actual translation segments. gpt-4.1-nano and gpt-5-nano both hit this.

A context summarization assistant had a different flavor of the same problem. Its system prompt was so long that user content (ranging from 5KB to 47KB) barely registered in the embedding. That's how glm-4.6-thinking ended up returning the same summary for every conversation.

The third pattern was subtler. For gpt-oss-120b and qwen3-vl-flash, the first 500 characters of every request were byte-for-byte identical. The varying content came after, but the embedding was already dominated by the shared prefix.

What the Research Says

This isn't a novel problem. Recent papers have quantified it.

UC Berkeley's vCache project found that correct and incorrect cache hits have "highly overlapping similarity distributions." The optimal threshold varies from 0.71 to 1.0 across different cached entries. No single number works. Their fix: learn a separate threshold per cache entry, which cut error rates by 6x while doubling hit rates. (vCache, 2025)

It gets worse when you mix query types. A category-aware caching study showed that a 0.80 threshold produces 15% false matches on code queries (sort_ascending vs sort_descending), while the same threshold misses valid paraphrases in conversational queries. One threshold, two failure modes. (Category-Aware Semantic Caching, 2025)

Banks hit this too. An InfoQ case study documented a RAG system where "Can I skip my loan payment this month" matched "What happens if I miss a loan payment" at 88.7% similarity. Different intent, same cached answer. They started at a 99% false positive rate and needed four rounds of optimization to get down to 3.8%. (InfoQ Banking Case Study, 2025)

The deeper issue: embeddings measure whether two prompts are semantically similar, not whether the same response can answer both. That gap is where false cache hits live. (Efficient Prompt Caching via Embedding Similarity, 2024)

Every paper we found agrees on one thing: embedding similarity alone isn't enough. You need a verification layer.

The Two-Layer Fix

We built two defenses. The first strips template noise before embedding. The second verifies hits after matching.

Layer 2: Content Extraction for Embeddings

Before generating an embedding, we now detect structured input (JSON) and extract only the meaningful, variable content.

The logic:

  1. Check if the message content starts with { or [
  2. If it parses as JSON, recursively collect all string leaf values
  3. Filter out short values (20 characters or fewer) since they're typically config fields like "zh", "formal", or "Product Page"
  4. If the extracted text is too short or empty, fall back to the original text
function extractContentForEmbedding(text: string): string {
  const extracted = tryExtractJsonContent(text);
  return extracted && extracted.length > 20 ? extracted : text;
}

This applies to both the system prompt and user message. For the translation plugin, the embedding now represents "Hello world" instead of a 2KB JSON blob. For the summarization assistant, it pulls the actual conversation out of the template wrapper.

The 20-character threshold was chosen empirically:

  • "zh" (2 chars): filtered. Config value.
  • "formal" (6 chars): filtered. Config value.
  • "Product Page" (12 chars): filtered. Template field.
  • "Translate product descriptions" (31 chars): kept. Meaningful content.
  • "The quarterly financial report..." (40+ chars): kept. Actual translation content.

Layer 3: Fingerprint Verification

After a semantic cache hit, we compare a hash of the current request's extracted text against the hash stored in the cached entry. If they don't match, the hit is rejected.

// On cache write
entry.metadata.textHash = fnv1aHash(extractedText);

// On cache read, after finding a similarity match
if (entry.metadata.textHash !== undefined) {
  if (entry.metadata.textHash !== fnv1aHash(currentExtractedText)) {
    // False positive: semantically similar but different content
    metrics.recordFingerprintRejection();
    return null;
  }
}

The hash uses the extracted text (post-Layer 2), not the raw input. Two requests with different template wrappers but identical actual content still match. Different content, different hash, rejected.

Old cache entries without textHash skip verification (backward compatible). They expire naturally via TTL.

We use FNV-1a (32-bit) for the hash. Fast, deterministic, and a ~1 in 4 billion collision rate is fine for checking a single cache hit.

Why Not Just Raise the Threshold?

Our threshold is already 0.95. Raising it doesn't help. The problem is that structurally similar inputs produce similarity scores above 0.95 no matter what the actual content says.

vCache's data backs this up: the similarity distributions of correct and incorrect hits overlap so much that no single cutoff separates them. Push the threshold to 0.99 and you'll kill legitimate cache hits for paraphrases without eliminating the false positives from template-heavy requests.

Fix the input, verify the output. Don't fiddle with the threshold.

Results

With both layers deployed:

Metric Before After
gpt-4.1-nano false positives 198/199 0
False positive share of all cache hits ~95% <5%
Legitimate cache hit rate Unchanged Unchanged
Added latency per request 0 <1ms (JSON parse + FNV hash)

Layer 2 alone would have fixed the translation plugin. Layer 3 is the safety net for cases where JSON extraction doesn't fully separate the content, or for structured inputs that aren't JSON.

Takeaways

If you're running a semantic cache in production:

  1. Monitor response diversity. If a model has 100% cache hit rate and 1 unique response, you have a false positive problem. Query: SELECT model, uniqExact(substring(response_body, 1, 200)) as unique_responses, count() as total FROM request_logs WHERE cache_hit = true GROUP BY model.

  2. Structured input kills naive embedding. Any request with a fixed template (JSON APIs, system prompt wrappers, form-filling tasks) will produce artificially high similarity scores. Preprocess before embedding.

  3. A verification layer is not optional. Every production semantic cache in the research literature has one. The question is whether you use a lightweight hash check, a cross-encoder reranker, or a full LLM verification call. Pick based on your latency budget.

  4. Global thresholds are a compromise, not a solution. Different query types need different thresholds. If you can't do per-category or per-entry thresholds, at least add input preprocessing to normalize the embedding quality across categories.

Semantic caching can cut 30-70% of LLM API costs. But without input preprocessing and hit verification, you're serving stale answers and calling it a performance win.


LemonData provides unified access to 300+ AI models with built-in caching, routing, and cost optimization. Try it free with $1 credit.

Share: