Standard RAG has a ceiling. You embed documents, build an index, retrieve top-k results, and stuff them into a prompt. It works until it doesn't. When your queries require multi-hop reasoning, cross-document synthesis, or retrieval refinement based on partial answers, basic RAG fails silently. It returns confident responses built on incomplete context.
Agentic RAG solves this by giving an agent control over the retrieval process. The agent decides what to search for, evaluates whether results are sufficient, and iterates until it has enough information to answer. This flexibility comes with costs: more compute, more latency, and new failure modes.
This guide covers when agentic RAG makes sense, how to architect it, and how to prevent the runaway retrieval loops that burn through your budget.
What Agentic RAG Actually Is
Basic RAG follows a fixed pipeline:
Query → Embed → Retrieve → Rank → Generate
Every query goes through the same steps. The system retrieves the same number of documents regardless of query complexity. A simple factual question and a complex analytical question receive identical treatment.
Agentic RAG adds a decision layer:
Query → Agent → [Search, Evaluate, Refine, Search Again...] → Generate
The agent controls the retrieval loop. It decides:
- What queries to run against the vector store
- Whether retrieved documents are relevant
- Whether it has enough information to answer
- What follow-up queries might fill gaps
This transforms retrieval from a single operation into a directed search process. The agent can decompose complex questions, chase references across documents, and recognize when it lacks sufficient information.
The core components:
- Retrieval tools — Vector search, keyword search, structured queries, or combinations
- Evaluation logic — Mechanisms to assess relevance and completeness
- Query reformulation — Ability to generate new search queries based on partial results
- Termination criteria — Rules that stop the search loop
Strip away the complexity, and agentic RAG is RAG with a feedback loop. The agent observes retrieval results and adjusts its strategy.
When Traditional RAG Breaks Down
Basic RAG works for direct questions with answers contained in single documents. It fails in predictable ways.
Multi-hop Questions
"What was the revenue impact of the pricing change we made after the competitor launched their enterprise tier?"
This question requires finding three pieces of information: the competitor launch, your pricing response, and the revenue data from that period. Basic RAG retrieves documents similar to the full query, which rarely surfaces all three. You get partial context and incomplete answers.
Comparison Queries
"How does our Q3 approach to customer retention differ from Q1?"
Two separate time periods need retrieval. Basic RAG blends them into one query, returning documents that mention retention without temporal specificity. The LLM hallucinates differences or admits it doesn't know.
Evolving Context
"Based on what you found about the infrastructure outage, what were the follow-up actions?"
The answer depends on what was retrieved for the first part. Basic RAG can't adapt its second retrieval based on the first. You need sequential, dependent retrieval.
Sparse Information Distribution
Some answers span many documents, each containing a small piece. Technical specifications spread across wikis, design docs, and meeting notes. Basic RAG's top-k retrieval misses the long tail of relevant documents.
Ambiguous Queries
"Tell me about the project status."
Which project? Basic RAG retrieves documents matching "project status" without disambiguation. An agent can identify the ambiguity, search for active projects, and either clarify or make a reasonable assumption.
If your questions fit neatly into "find the document that answers this," basic RAG works. If your questions require reasoning about what to retrieve, you need the agentic approach.
Agentic RAG Architecture Patterns
Three patterns cover most production use cases. Start simple and add complexity only when needed.
Pattern 1: Iterative Retrieval
The simplest agentic pattern. The agent retrieves, evaluates, and retrieves again until satisfied.
while not sufficient_context:
documents = retrieve(current_query)
evaluation = assess_relevance(documents, original_question)
if evaluation.has_gaps:
current_query = reformulate_query(evaluation.gaps)
else:
sufficient_context = True
When to use: Questions that might need refinement but don't require complex decomposition. Customer support, knowledge base queries, documentation lookup.
Implementation notes:
- Cap iterations (3-5 is typical)
- Track retrieved document IDs to avoid duplicates
- Use the evaluation step to identify specific gaps, not just "need more"
Pattern 2: Query Decomposition
The agent breaks complex questions into sub-questions, retrieves for each, then synthesizes.
sub_questions = decompose(original_question)
contexts = []
for sub_q in sub_questions:
docs = retrieve(sub_q)
contexts.append(summarize(docs, sub_q))
final_answer = synthesize(contexts, original_question)
When to use: Multi-part questions, comparison queries, questions requiring aggregation across topics.
Implementation notes:
- Decomposition quality determines everything. Bad sub-questions cascade into bad answers.
- Summarize sub-contexts before synthesis to fit context windows
- Consider parallel retrieval for independent sub-questions
Pattern 3: Retrieval-Augmented Reasoning
The agent interleaves retrieval with reasoning steps. Each reasoning step can trigger new retrieval based on conclusions reached.
while not complete:
action = agent.decide_next_action(context)
if action.type == "retrieve":
new_docs = retrieve(action.query)
context.add(new_docs)
elif action.type == "reason":
conclusion = reason(context, action.focus)
context.add(conclusion)
elif action.type == "answer":
complete = True
When to use: Research tasks, complex analysis, questions requiring chains of inference.
Implementation notes:
- This pattern uses the most compute
- Reasoning steps should be explicit and logged
- Requires careful termination conditions to prevent infinite loops
Hybrid Approaches
Production systems often combine patterns. A common setup:
- Attempt basic RAG first
- If confidence is low, decompose the question
- Retrieve for sub-questions with iterative refinement
- Synthesize with reasoning steps
This keeps costs low for simple queries while handling complex ones appropriately.
Cost Considerations
Agentic RAG costs more than basic RAG. The question is whether the improvement justifies the expense.
Cost Multipliers
LLM inference: Each agent decision, query reformulation, and evaluation step requires inference. A basic RAG query uses one generation call. Agentic RAG might use 5-15 calls for complex questions.
Embedding: Query reformulation means more embedding operations. Usually minor compared to LLM costs, but adds up at scale.
Vector search: More queries against your index. Managed vector databases charge per query. Self-hosted systems need capacity for higher QPS.
Latency as cost: Slower responses reduce user satisfaction. Agent loops add 2-10x latency compared to basic RAG.
Typical Cost Ratios
Based on production systems we've built:
| Query Type | Basic RAG | Agentic RAG | Multiplier |
|---|---|---|---|
| Simple lookup | $0.002 | $0.002 | 1x |
| Multi-hop | $0.002 | $0.01 | 5x |
| Complex analysis | $0.002 | $0.03 | 15x |
These are estimates. Your costs depend on model selection, index size, and architecture decisions.
Optimization Strategies
Classify queries first. Route simple queries to basic RAG, complex queries to agentic pipelines. A lightweight classifier costs less than unnecessary agent iterations.
Cache aggressively. Query reformulations often repeat. Cache embeddings and retrieval results. Sub-question decompositions for similar queries can be reused.
Use smaller models for intermediate steps. Full-scale models for final generation, faster models for query reformulation and relevance assessment.
Set hard limits. Cap iterations, cap tokens, cap wall-clock time. Unbounded agents are expensive agents.
Monitor cost per query. Track the distribution. Investigate outliers. A few runaway queries can dominate your bill.
Preventing Runaway Retrieval Loops
The biggest operational risk in agentic RAG is the infinite loop. An agent that can't find sufficient information keeps searching. Without safeguards, it burns through budget and delivers nothing.
Why Loops Happen
Impossible questions. The information doesn't exist in your corpus. The agent keeps reformulating queries looking for something that isn't there.
Poor termination criteria. "Search until confident" is not a termination condition. The agent never becomes confident enough to stop.
Relevance misjudgment. The agent retrieves relevant documents but assesses them as irrelevant. It discards good context and searches for better.
Circular reformulation. Query A produces no results, reformulates to Query B, which reformulates back to Query A.
Prevention Mechanisms
Hard iteration limits. No query exceeds N retrieval cycles. Period. Start with 5. Adjust based on observed behavior.
MAX_ITERATIONS = 5
for i in range(MAX_ITERATIONS):
results = retrieve(query)
if sufficient(results):
break
query = reformulate(query, results)
else:
return generate_with_best_available(all_results)
Query history tracking. Store embeddings of all queries in a session. Before executing a new query, check similarity to previous queries. If above threshold, force termination.
Token budgets. Set a maximum token spend per request. Track cumulative usage across all LLM calls. Terminate when budget is exhausted.
Wall-clock timeouts. Set maximum request duration. Agent loops that exceed it return best-effort answers with explicit uncertainty flags.
Confidence degradation. If retrieval confidence decreases across iterations, stop. You're not finding better information; you're finding worse.
Graceful Degradation
When limits trigger, the system should still provide value:
- Return the best answer possible with retrieved context
- Explicitly flag uncertainty or incompleteness
- Indicate what information would improve the answer
- Log the failure mode for analysis
Never return nothing. A partial answer with honest uncertainty beats a timeout error.
Decision Framework
Use this framework to decide between basic RAG and agentic approaches.
Start with Basic RAG When:
- Questions typically have single-document answers
- Query patterns are predictable
- Latency requirements are strict (<2s)
- Cost sensitivity is high
- You're early in development and need to validate the use case
Upgrade to Agentic RAG When:
- Users report incomplete or missing information
- Questions frequently require synthesis across sources
- Query complexity varies significantly
- You can tolerate 5-10s latency
- The cost increase is justified by quality improvement
Evaluation Criteria
Before committing to agentic RAG, measure:
Answer quality on complex queries. Sample 50 complex questions. Rate basic RAG answers. Estimate agentic RAG improvement. Is the delta worth the investment?
Cost tolerance. Model the worst case: every query hits maximum iterations. Can you absorb that cost? What's your alert threshold?
Latency requirements. User-facing? Real-time? Batch? Agentic RAG fits async workflows better than synchronous ones.
Corpus characteristics. How is information distributed? If answers span many documents, agentic RAG helps. If answers are concentrated, it's overkill.
Migration Path
If you decide to upgrade:
- Implement agentic RAG alongside basic RAG
- Route 10% of traffic to the new system
- Compare quality metrics and costs
- Expand routing as confidence increases
- Maintain the basic path for simple queries
Don't rip out basic RAG entirely. Hybrid routing optimizes cost and performance.
Key Takeaways
-
Agentic RAG gives agents control over retrieval. They decide what to search, evaluate results, and iterate. Basic RAG runs a fixed pipeline regardless of query complexity.
-
Traditional RAG fails on multi-hop questions, comparisons, and queries requiring synthesis across many documents. If your queries fit this pattern, agentic RAG will improve answer quality.
-
Three architecture patterns cover most needs: iterative retrieval for refinement, query decomposition for complex questions, retrieval-augmented reasoning for research tasks. Start simple.
-
Agentic RAG costs 5-15x more than basic RAG on complex queries. Optimize by classifying queries, caching aggressively, using smaller models for intermediate steps, and setting hard limits.
-
Runaway loops are the primary operational risk. Prevent them with iteration limits, query history tracking, token budgets, and wall-clock timeouts. Always return something, even if incomplete.
-
Decide based on answer quality improvement versus cost increase. Start with basic RAG. Upgrade when users report incomplete answers on complex queries and you can tolerate higher latency and cost.
Conclusion
Agentic RAG is not universally better than basic RAG. It's a tradeoff: better answers on complex queries in exchange for higher costs and new failure modes. The right choice depends on your query patterns, cost tolerance, and quality requirements.
If your users ask simple questions with single-document answers, basic RAG is correct. If they ask questions requiring reasoning about what to retrieve, agentic RAG will serve them better.
Build the infrastructure to support both. Route intelligently. Monitor costs. Prevent runaway loops. The result is a retrieval system that handles simple queries efficiently and complex queries effectively.
StencilWash builds retrieval systems for companies that need answers, not just search results. If you're evaluating RAG architectures for production use, we should talk.

Builds agentic systems with precision, depth, and zero tolerance for failure.
