RAG vs. Context Stuffing: Why selective retrieval is extra environment friendly and dependable than dumping all knowledge into the immediate

Massive context home windows have dramatically elevated how a lot data trendy language fashions can course of in a single immediate. With fashions able to dealing with a whole lot of hundreds—and even thousands and thousands—of tokens, it’s straightforward to imagine that Retrieval-Augmented Era (RAG) is now not essential. If you happen to can match a complete codebase or documentation library into the context window, why construct a retrieval pipeline in any respect?

The important thing distinction is {that a} context window defines how a lot the mannequin can see, whereas RAG determines what the mannequin ought to see. A big window will increase capability, nevertheless it doesn’t enhance relevance. RAG filters and selects an important data earlier than it reaches the mannequin, enhancing signal-to-noise ratio, effectivity, and reliability. The 2 approaches remedy completely different issues and aren’t substitutes for each other.

On this article, we evaluate each methods instantly. Utilizing the OpenAI API, we consider Retrieval-Augmented Era in opposition to brute-force context stuffing on the identical documentation corpus. We measure token utilization, latency, and value—and show how burying important data inside massive prompts can have an effect on mannequin efficiency. The outcomes spotlight why massive context home windows complement RAG reasonably than substitute it.

Putting in the dependencies

import os
import time
import textwrap
import numpy as np
import tiktoken
from openai import OpenAI

from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass('Enter OpenAI API Key: ')
consumer = OpenAI()

We use text-embedding-3-small because the embedding mannequin to transform paperwork and queries into vector representations for environment friendly semantic retrieval. For era and reasoning, we use gpt-4o, with token accounting dealt with by way of its corresponding tiktoken encoding to precisely measure context measurement and value.

EMBED_MODEL = "text-embedding-3-small"
CHAT_MODEL  = "gpt-4o"
ENC         = tiktoken.encoding_for_model("gpt-4o")

Creating the doc corpus

This corpus serves because the retrieval supply for our benchmark. Within the RAG setup, embeddings are generated for every doc and related chunks are retrieved primarily based on semantic similarity. Within the context-stuffing setup, the whole corpus is injected into the immediate. As a result of the paperwork include particular numeric clauses (e.g., closing dates, fee caps, refund home windows), they’re well-suited for testing retrieval accuracy, sign density, and the “Misplaced within the Center” impact below large-context circumstances.

The corpus consists of 10 structured coverage paperwork totaling roughly 650 tokens, with every doc ranging between 54 and 83 tokens. This measurement retains the dataset manageable whereas nonetheless reflecting the variety and density of a sensible enterprise documentation set.

Though comparatively small, the corpus contains tightly packed numerical clauses, conditional guidelines, and compliance statements—making it appropriate for evaluating retrieval precision, reasoning accuracy, and token effectivity. It supplies a managed atmosphere to match RAG-based selective retrieval in opposition to full context stuffing with out introducing exterior noise.

def count_tokens(textual content: str) -> int:
    return len(ENC.encode(textual content))
DOCS = [
    {
        "id": 1, "title": "Refund Policy",
        "content": (
            "Customers may request a full refund within 30 days of purchase. "
            "Refunds are processed within 5-7 business days to the original payment method. "
            "Digital products are non-refundable once the download link has been accessed. "
            "Subscription cancellations stop future charges but do not trigger automatic refunds "
            "for the current billing cycle unless the cancellation is made within 48 hours of renewal."
        )
    },
    {
        "id": 2, "title": "Shipping Information",
        "content": (
            "Standard shipping takes 5-7 business days. Express shipping delivers in 2-3 business days. "
            "Orders over $50 qualify for free standard shipping within the continental US. "
            "International shipping is available to 30 countries and takes 10-21 business days. "
            "Tracking numbers are emailed within 24 hours of dispatch."
        )
    },
    {
        "id": 3, "title": "Account Security",
        "content": (
            "Two-factor authentication (2FA) can be enabled from the Security tab in account settings. "
            "Passwords must be at least 12 characters and include one uppercase letter, one number, "
            "and one special character. Active sessions expire after 30 days of inactivity. "
            "Suspicious login attempts trigger an automatic account lock and a reset email."
        )
    },
    {
        "id": 4, "title": "API Rate Limits",
        "content": (
            "Free tier: 100 requests per day, max 10 requests per minute. "
            "Pro tier: 10 000 requests per day, max 200 requests per minute. "
            "Enterprise tier: unlimited requests, burst up to 1 000 per minute. "
            "All responses include X-RateLimit-Remaining and X-RateLimit-Reset headers. "
            "Exceeding limits returns HTTP 429 with a Retry-After header."
        )
    },
    {
        "id": 5, "title": "Data Privacy & GDPR",
        "content": (
            "All user data is encrypted at rest using AES-256 and in transit using TLS 1.3. "
            "We never sell or rent personal data to third parties. "
            "The platform is fully GDPR and CCPA compliant. "
            "Data deletion requests are processed within 72 hours. "
            "Users can export all their data in JSON or CSV format from the Privacy section."
        )
    },
    {
        "id": 6, "title": "Billing & Subscription Cycles",
        "content": (
            "Subscriptions renew automatically on the same calendar day each month. "
            "Annual plans offer a 20 % discount compared to monthly billing. "
            "Invoices are sent via email 3 days before each renewal. "
            "Failed payments retry three times over 7 days before the account is downgraded."
        )
    },
    {
        "id": 7, "title": "Supported File Formats",
        "content": (
            "Supported upload formats: PDF, DOCX, XLSX, PPTX, PNG, JPG, WebP, MP4, MOV. "
            "Maximum individual file size is 100 MB. "
            "Batch uploads support up to 50 files simultaneously. "
            "Files are virus-scanned on upload and quarantined if threats are detected."
        )
    },
    {
        "id": 8, "title": "Compliance Certifications",
        "content": (
            "The platform holds SOC 2 Type II certification, renewed annually. "
            "ISO 27001 compliance is maintained with quarterly internal audits. "
            "A HIPAA Business Associate Agreement (BAA) is available for healthcare customers on the Enterprise plan. "
            "PCI-DSS Level 1 compliance covers all payment processing flows."
        )
    },
    {
        "id": 9, "title": "SLA & Uptime Guarantees",
        "content": (
            "Enterprise SLA guarantees 99.9 % monthly uptime (≤ 43 minutes downtime/month). "
            "Scheduled maintenance windows occur every Sunday between 02:00-04:00 UTC. "
            "Unplanned incidents are communicated via status.example.com within 15 minutes. "
            "SLA breaches are compensated with service credits applied to the next invoice."
        )
    },
    {
        "id": 10, "title": "Cancellation Policy",
        "content": (
            "Users can cancel at any time from the Subscription tab in account settings. "
            "Annual plan holders receive a pro-rated refund for unused months if cancelled within 30 days of renewal. "
            "Cancellation takes effect at the end of the current billing period; access continues until then. "
            "Re-activation within 90 days of cancellation restores all historical data."
        )
    },
]

total_tokens = sum(count_tokens(d["content"]) for d in DOCS)
print(f"Corpus: {len(DOCS)} paperwork | {total_tokens} tokens totaln")
for d in DOCS:
    print(f"  [{d['id']:02d}] {d['title']:<35} ({count_tokens(d['content'])} tokens)")

Constructing the Embedding Index

We generate vector embeddings for all 10 paperwork utilizing the text-embedding-3-small mannequin and retailer them in a NumPy array. Every doc is transformed right into a 1,536-dimensional float32 vector, producing an index with form (10, 1536).

The complete indexing step completes in 1.82 seconds, demonstrating how light-weight semantic indexing is at this scale. This vector matrix now acts as our retrieval layer—enabling quick similarity search throughout the RAG workflow as an alternative of scanning uncooked textual content at inference time.

def embed_texts(texts: listing[str]) -> np.ndarray:
    """Name OpenAI Embeddings API and return a (N, 1536) float32 array."""
    response = consumer.embeddings.create(mannequin=EMBED_MODEL, enter=texts)
    return np.array([item.embedding for item in response.data], dtype=np.float32)

print("Constructing index ... ", finish="", flush=True)
t0 = time.perf_counter()

corpus_texts = [d["content"] for d in DOCS]
index        = embed_texts(corpus_texts)          # form: (10, 1536)

elapsed = time.perf_counter() - t0
print(f"performed in {elapsed:.2f}s  |  index form: {index.form}")

Retrieval and Immediate helpers

The beneath capabilities implement the complete comparability pipeline between RAG and context stuffing.

retrieve() embeds the consumer question, computes cosine similarity by way of a dot product in opposition to the precomputed index, and returns the top-k most related paperwork with similarity scores. As a result of text-embedding-3-small outputs unit-norm vectors, the dot product instantly represents cosine similarity—retaining retrieval each easy and environment friendly.
build_rag_prompt() constructs a targeted immediate utilizing solely the retrieved chunks, making certain excessive sign density and minimal irrelevant context.
build_stuffed_prompt() constructs a brute-force immediate by injecting the whole corpus into the context, simulating the “simply use the entire window” method.
call_llm() sends the immediate to gpt-4o, measures latency, and captures token utilization, permitting us to instantly evaluate price, velocity, and effectivity between the 2 methods.

Collectively, these helpers create a managed atmosphere to benchmark retrieval precision versus uncooked context capability.

def retrieve(question: str, ok: int = 3) -> listing[dict]:
    """
    Embed the question, compute cosine similarity in opposition to the index,
    and return the top-k doc dicts with their scores.

    text-embedding-3-small returns unit-norm vectors, so the dot product
    IS cosine similarity -- no additional normalisation wanted.
    """
    q_vec   = embed_texts([query])[0]             # form: (1536,)
    scores  = index @ q_vec                       # dot product = cosine similarity
    top_idx = np.argsort(scores)[::-1][:k]        # top-k indices, highest first
    return [{"doc": DOCS[i], "rating": float(scores[i])} for i in top_idx]


def build_rag_prompt(question: str, chunks: listing[dict]) -> str:
    """Construct a targeted immediate from solely the retrieved chunks."""
    context_parts = [
        f"[Source: {c['doc']['title']}]n{c['doc']['content']}"
        for c in chunks
    ]
    context = "nn---nn".be a part of(context_parts)
    return (
        f"You're a useful assist assistant. "
        f"Reply the query beneath utilizing the supplied context. "
        f"Be particular and direct.nn"
        f"CONTEXT:n{context}nn"
        f"QUESTION: {question}"
    )


def build_stuffed_prompt(question: str) -> str:
    """Construct a immediate that dumps the whole corpus into the context."""
    context_parts = [
        f"[Source: {d['title']}]n{d['content']}"
        for d in DOCS
    ]
    context = "nn---nn".be a part of(context_parts)
    return (
        f"You're a useful assist assistant. "
        f"Reply the query beneath utilizing the supplied context. "
        f"Be particular and direct.nn"
        f"CONTEXT:n{context}nn"
        f"QUESTION: {question}"
    )


def call_llm(immediate: str) -> tuple[str, float, int, int]:
    """Returns (reply, latency_ms, input_tokens, output_tokens)."""
    t0  = time.perf_counter()
    res = consumer.chat.completions.create(
        mannequin       = CHAT_MODEL,
        messages    = [{"role": "user", "content": prompt}],
        temperature = 0,
    )
    latency_ms = (time.perf_counter() - t0) * 1000
    reply     = res.selections[0].message.content material.strip()
    return reply, latency_ms, res.utilization.prompt_tokens, res.utilization.completion_tokens

Evaluating the approaches

This block runs a direct, side-by-side comparability between Retrieval-Augmented Era (RAG) and brute-force context stuffing utilizing the identical consumer question. Within the RAG method, the system first retrieves the highest three most related paperwork primarily based on semantic similarity, builds a targeted immediate utilizing solely these chunks, after which sends that condensed context to the mannequin. It additionally prints similarity scores, token counts, and latency, permitting us to look at how a lot context is definitely required to reply the query successfully.

In distinction, the context-stuffing method constructs a immediate that features all 10 paperwork, no matter relevance, and sends the whole corpus to the mannequin. By measuring enter tokens, output tokens, and response time for each strategies below similar circumstances, we isolate the architectural distinction between selective retrieval and brute-force loading. This makes the trade-offs in effectivity, price, and efficiency concrete reasonably than theoretical.

QUERY = "How do I request a refund and the way lengthy does it take"

DIVIDER = "─" * 65

print(f"n{'='*65}")
print(f"  QUERY: {QUERY}")
print(f"{'='*65}n")

# ── RAG ──────────────────────────────────────────────────────────────────────
print("[ APPROACH 1 ]  RAG  (retrieve then purpose)")
print(DIVIDER)

chunks     = retrieve(QUERY, ok=3)
rag_prompt = build_rag_prompt(QUERY, chunks)

print(f"Prime-{len(chunks)} retrieved chunks:")
for c in chunks:
    preview = c["doc"]["content"][:75].substitute("n", " ")
    print(f"  • {c['doc']['title']:<40}  similarity: {c['score']:.4f}")
    print(f"    "{preview}..."")

print(f"nTotal tokens being despatched to LLM: {count_tokens(rag_prompt)}n")

rag_answer, rag_latency, rag_in, rag_out = call_llm(rag_prompt)

print(f"Reply:n{textwrap.fill(rag_answer, 65)}")
print(f"nTokens  →  enter: {rag_in:>6,}  |  output: {rag_out:>4,}  |  whole: {rag_in+rag_out:>6,}")
print(f"Latency →  {rag_latency:,.0f} msn")


# ── Strategy 2: Context Stuffing ──────────────────────────────────────────────
print("[ APPROACH 2 ]  Context Stuffing  (dump every little thing, then purpose)")
print(DIVIDER)

stuffed_prompt = build_stuffed_prompt(QUERY)
print(f"Sending all {len(DOCS)} paperwork ({count_tokens(stuffed_prompt):,} tokens) to the LLM ...n")

stuff_answer, stuff_latency, stuff_in, stuff_out = call_llm(stuffed_prompt)

print(f"Reply:n{textwrap.fill(stuff_answer, 65)}")
print(f"nTokens  →  enter: {stuff_in:>6,}  |  output: {stuff_out:>4,}  |  whole: {stuff_in+stuff_out:>6,}")
print(f"Latency →  {stuff_latency:,.0f} msn")

The outcomes present that each approaches produce an accurate and practically similar reply — however the effectivity profile could be very completely different.

With RAG, solely the three most related paperwork have been retrieved, leading to 278 tokens despatched to the mannequin (285 precise immediate tokens). The whole token utilization was 347, and the response latency was 783 ms. The retrieved chunks clearly prioritized the Refund Coverage, which instantly incorporates the reply, whereas the remaining two paperwork have been secondary matches primarily based on semantic similarity.

With context stuffing, all 10 paperwork have been injected into the immediate, rising the enter measurement to 775 tokens and whole utilization to 834 tokens. Latency practically doubled to 1,518 ms. Regardless of processing greater than twice the enter tokens, the mannequin produced basically the identical reply.

The important thing takeaway will not be that stuffing fails — it really works at small scale — however that it’s inefficient. RAG achieved the identical output with lower than half the tokens and roughly half the latency. As corpus measurement grows from 10 paperwork to hundreds, this hole compounds dramatically. What seems innocent at 768 tokens turns into prohibitively costly and sluggish at 500k+ tokens. That is the financial and architectural argument for retrieval: optimize sign earlier than reasoning.

token_ratio   = stuff_in / rag_in
latency_ratio = stuff_latency / rag_latency

COST_PER_1M = 2.5
rag_cost    = (rag_in   / 1_000_000) * COST_PER_1M
stuff_cost  = (stuff_in / 1_000_000) * COST_PER_1M

print(f"n{'='*65}")
print(f"  HEAD-TO-HEAD SUMMARY")
print(f"{'='*65}")
print(f"  {'Metric':<30} {'RAG':>10}   {'Stuffing':>10}")
print(f"  {DIVIDER}")
print(f"  {'Enter tokens':<30} {rag_in:>10,}   {stuff_in:>10,}")
print(f"  {'Output tokens':<30} {rag_out:>10,}   {stuff_out:>10,}")
print(f"  {'Latency (ms)':<30} {rag_latency:>10,.0f}   {stuff_latency:>10,.0f}")
print(f"  {'Value per name (USD)':<30} ${rag_cost:>9.6f}   ${stuff_cost:>9.6f}")
print(f"  {DIVIDER}")
print(f"  {'Token multiplier':<30} {'1x':>10}   {token_ratio:>9.1f}x")
print(f"  {'Latency multiplier':<30} {'1x':>10}   {latency_ratio:>9.1f}x")
print(f"  {'Value multiplier':<30} {'1x':>10}   {token_ratio:>9.1f}x")
print(f"{'='*65}")

The pinnacle-to-head comparability makes the trade-offs specific. Context stuffing required 2.7× extra enter tokens, practically 2× the latency, and a couple of.7× the associated fee per name—whereas producing basically the identical reply as RAG. The output token rely remained comparable, which means the extra expense got here solely from pointless context.

Misplaced within the Center Impact

To show the “Misplaced within the Center” impact, we create a managed setup the place a important coverage replace—the needle—states that Enterprise prospects with an energetic HIPAA BAA are entitled to a 90-day refund window as an alternative of the usual 30 days. This clause instantly solutions the question however is deliberately buried inside roughly 800 tokens of irrelevant filler textual content designed to simulate a bloated, overstuffed immediate. By asking, “What’s the refund window for Enterprise prospects with a HIPAA BAA?”, we will take a look at whether or not the mannequin reliably extracts the buried clause when it’s surrounded by noise, illustrating how massive context alone doesn’t assure correct consideration or retrieval.

NEEDLE = (
    "POLICY UPDATE: Enterprise prospects with an energetic HIPAA BAA "
    "are entitled to a 90-day full refund window, not the usual 30-day window."
)

# ~800 tokens of irrelevant padding to simulate a bloated doc
FILLER = (
    "This part covers common platform tips and greatest practices. "
    "Customers ought to guarantee they learn all documentation earlier than continuing. "
    "Configuration choices might differ relying on the chosen plan tier. "
    "Please discuss with the onboarding information for step-by-step directions. "
    "Assist is on the market 24/7 by way of chat, e-mail, and cellphone for Enterprise customers. "
) * 30

NEEDLE_QUERY = "What's the refund window for Enterprise prospects with a HIPAA BAA?"

def run_lost_in_middle():
    print(f"n{'='*65}")
    print("  'LOST IN THE MIDDLE' EXPERIMENT")
    print(f"{'='*65}")
    print(f"Question : {NEEDLE_QUERY}")
    print(f"Needle: "{NEEDLE[:65]}..."n")

    # Situation A: Targeted (simulates  RAG retrieval)
    prompt_a = (
        f"You're a useful assist assistant. "
        f"Reply the query utilizing the context beneath.nn"
        f"CONTEXT:n{NEEDLE}nn"
        f"QUESTION: {NEEDLE_QUERY}"
    )

    # Situation B: Buried (simulates stuffing -- needle is in the course of noise)
    buried   = f"{FILLER}nn{NEEDLE}nn{FILLER}"
    prompt_b = (
        f"You're a useful assist assistant. "
        f"Reply the query utilizing the context beneath.nn"
        f"CONTEXT:n{buried}nn"
        f"QUESTION: {NEEDLE_QUERY}"
    )

    print(f"[ A ] Targeted context  ({count_tokens(prompt_a):,} enter tokens)")
    ans_a, _, _, _ = call_llm(prompt_a)
    print(f"Reply: {ans_a}n")

    print(f"[ B ] Needle buried in filler  ({count_tokens(prompt_b):,} enter tokens)")
    ans_b, _, _, _ = call_llm(prompt_b)
    print(f"Reply: {ans_b}n")

    print("─" * 65)

On this experiment, each setups return the right reply — 90 days — however the distinction in context measurement is critical. The targeted model required solely 67 enter tokens, delivering the right response with minimal context. In distinction, the stuffed model required 3,729 enter tokens, over 55× extra enter, to reach on the similar reply.

At this scale, the mannequin was nonetheless in a position to find the buried clause. Nevertheless, the consequence highlights an essential precept: correctness alone will not be the metric — effectivity and reliability are. As context measurement will increase additional, consideration diffusion, latency, and value compound, and retrieval precision turns into more and more important. The experiment exhibits that giant context home windows can nonetheless succeed, however they achieve this at dramatically greater computational expense and with better danger as paperwork develop longer and extra complicated.

I’m a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and I’ve a eager curiosity in Information Science, particularly Neural Networks and their utility in varied areas.

Source link

RAG vs. Context Stuffing: Why selective retrieval is extra environment friendly and dependable than dumping all knowledge into the immediate

Information Labs debuts a brand new form of interpretable LLM

Google DeepMind Researchers Apply Semantic Evolution to Create Non Intuitive VAD-CFR and SHOR-PSRO Variants for Superior Algorithmic Convergence

Individuals are destroying Flock surveillance cameras

RAG vs. Context Stuffing: Why selective retrieval is extra environment friendly and dependable than dumping all knowledge into the immediate

Putting in the dependencies

Creating the doc corpus

Constructing the Embedding Index

Retrieval and Immediate helpers

Evaluating the approaches

Misplaced within the Center Impact

Related Posts

Information Labs debuts a brand new form of interpretable LLM

Google DeepMind Researchers Apply Semantic Evolution to Create Non Intuitive VAD-CFR and SHOR-PSRO Variants for Superior Algorithmic Convergence

Individuals are destroying Flock surveillance cameras