From Long Prompt to RAG: How to Build Robust AI Agents with Your Knowledge Base

Nov 14, 2025

5 minutes

Reading time

Do we really need RAG – or is it enough to just put everything into the prompt? In this article, we will look at how RAG and Context Retrieval actually work, why "everything" never ends up in a request – and when you should switch from "just putting everything into the prompt" to a structured retrieval architecture.

What exactly is RAG – and why does "everything" never come with it?

Before we discuss "long prompt vs. RAG," a clarification:

Your entire knowledge base never lands in a single AI query.

Even with classic RAG, the process looks like this:

Prepare documents:
Convert PDFs, Confluence, SharePoint, code, manuals into text, clean them up, and break them down into meaningful chunks (paragraphs, chapters, functions).
Build an index:
- Vector index (embeddings) for semantic similarity.
- Optionally additional: classic full-text index (BM25) to cover keywords well, as semantic similarity can often miss identical phrases.
Retrieval per user query
- The user asks a question.
- The system searches for the most relevant 10–50 chunks in the index.
- Only these chunks land as "context" in the prompt.
Generation
- LLM receives: system prompt, user question, found chunks.
- An answer is generated based on this excerpt – not based on the entire dataset.

Even if you use "RAG," a retrieval step always decides which small parts of your knowledge base come into the prompt. The myth "we just load all our knowledge into the AI" is thus never true – technically this is not possible with the current context windows.

The simplest solution: just put everything up to ~200,000 tokens in the prompt

Now to the exciting part: Do I even need to build RAG – or is a very long prompt sufficient?

If your knowledge base is manageable (e.g., manual, internal wiki, 100–500 pages) and doesn't change constantly, then the simplest idea is often the best: Take your entire, cleaned knowledge base (up to about 200k tokens), put it in the prompt – done.

Of course, not as a 500-page PDF in one go, but processed neatly. Structure documents into meaningful sections (chapters, headings) and use structured representations like JSON/YAML with title, type, content. Clear system instructions additionally help the model understand how to handle these contents ("Answer based only on the following information").

Classic RAG: search selectively instead of sending everything

The core of classic RAG is:

AI receives a request.
You search for similar chunks in the vector index.
The best hits (e.g., top-20) move into the queries (prompt) and serve as additional context.

This way, you send the model only a small, relevant excerpt of your knowledge base – instead of everything.

Contextual Retrieval: fewer false hits, better answers

Anthropic proposes a relatively simple but very effective improvement to this retrieval step with Contextual Retrieval. The idea here is to supplement the chunks in the classic RAG system with context from the original document it came from. This reduces the error rate for targeted questions about the knowledge base by up to 67%.

How we approach this topic at Ahoi Kapptn!

Our approach also follows our process Understand → Develop → Optimize:

Understand

The beginning is understanding. Together with you, we clarify the contents you actually have – meaning the formats, quality, and size of your knowledge base. We look at which use cases are in the foreground, such as support, sales, internal onboarding, or sports data, and what requirements exist regarding security, governance, and on-prem or open-source models.

Often we start here with a compact AI workshop, where we prioritize use cases and decide whether a long prompt is sufficient or whether you will need RAG sooner or later.

Develop

In this phase, we implement what was jointly decided. For smaller knowledge bases, this means: clean data preparation, meaningful structure, a well-designed long prompt, and thoughtful prompt design – with this, you are often already productive. For larger setups, we design and implement a RAG architecture with index, retrieval, and suitable guardrails. Where appropriate, we supplement this with a contextual retrieval pipeline with BM25, embeddings, and reranker to further improve hit quality.

Optimize

In the Optimize phase, we look at how the system performs in everyday life. We monitor which questions are actually asked and where the system fails. Based on measurable KPIs like hit quality, latency, usage rates, and possibly manual evaluations, we iterate step by step: We refine prompts, adjust retrieval parameters, and expand the use of contextual retrieval as needed.