Retrieval Augmented Generation (RAG): Building AI Systems That Know Your Data
“The most valuable enterprise AI systems are not trained on the internet. They are built on your organization’s knowledge.”
Introduction
In the previous articles we learned:
Part 1
- LLMs
- Tokens
- Context windows
- Transformers
- Hallucinations
Part 2
- Embeddings
- Vector databases
- Similarity search
- Chunking
- Semantic retrieval
Now we combine everything.
This is the technology that transformed AI from:
“Chat with the internet”
into
“Chat with your company’s knowledge.”
Welcome to:
Retrieval Augmented Generation (RAG)
The Fundamental Problem
Suppose you ask ChatGPT:
What is Rahul’s architecture for the iEvent platform?
The model doesn’t know.
Why?
Because:
- Your documents are private.
- Your architecture diagrams are internal.
- Your APIs are unpublished.
- Your knowledge isn’t in the training data.
Traditional LLM:
Question
↓
LLM
↓
Answer
Result:
- Hallucination
- Wrong answers
- Generic responses
What RAG Solves
RAG adds knowledge retrieval.
Question
↓
Retriever
↓
Relevant Documents
↓
LLM
↓
Answer
The model answers using:
- Company documents
- PDFs
- Wikis
- Architecture documents
- Knowledge bases
- APIs
Why Enterprises Love RAG
RAG provides:
✅ Current information
✅ Private knowledge
✅ Fewer hallucinations
✅ Citations
✅ No expensive model training
✅ Better accuracy
The Complete RAG Architecture
Documents
↓
Text Extraction
↓
Chunking
↓
Embeddings
↓
Vector Database
---------------------
User Question
↓
Question Embedding
↓
Similarity Search
↓
Top Chunks
↓
LLM Prompt
↓
Response
Example: Company HR Policy
Suppose the document contains:
Employees receive 24 annual leave days.
User asks:
How many vacation days do we get?
The steps:
- Question converted to vector.
- Similar chunks retrieved.
- Context added.
- LLM answers.
Response:
Employees receive 24 annual leave days.
The model did not “know” this.
It retrieved it.
Step 1: Document Ingestion
Documents may come from:
- PDFs
- Word documents
- Confluence
- SharePoint
- Databases
- Wikis
- APIs
- Emails
Example:
Architecture.pdf
Deployment.docx
API-Specification.pdf
Step 2: Text Extraction
AI cannot read PDFs directly.
We extract text.
Example:
Spring Boot services communicate through Kafka.
Step 3: Chunking
Large documents are divided.
Example:
Chunk 1: Architecture
Chunk 2: Deployment
Chunk 3: Monitoring
Why Chunking Matters
Poor chunk:
Pages 1–50
Good chunk:
Spring Boot deployment on AWS.
Smaller chunks:
- Improve accuracy.
- Reduce costs.
- Improve retrieval.
Chunk Size Recommendations
| Content | Chunk Size |
|---|---|
| APIs | 300 tokens |
| Documentation | 500 tokens |
| Books | 1000 tokens |
| Code | 200 tokens |
| Policies | 400 tokens |
Overlapping Chunks
Without overlap:
Chunk 1:
Spring Boot deployment
Chunk 2:
using Kubernetes.
The meaning is lost.
With overlap:
Chunk 1:
Spring Boot deployment using
Chunk 2:
deployment using Kubernetes
Step 4: Embedding Generation
Each chunk becomes a vector.
Chunk
↓
Embedding Model
↓
Vector
Step 5: Storage
Store:
- Vector
- Text
- Metadata
Example:
| Text | Team | Year |
|---|---|---|
| API Guide | Platform | 2026 |
Step 6: User Question
User asks:
How are deadlines calculated?
Question becomes:
Embedding
Step 7: Similarity Search
The vector database returns:
- Holiday calculation logic.
- Working day rules.
- Deadline service documentation.
Step 8: Context Construction
The prompt becomes:
Answer using only the provided context.
Context:
------------
Deadline service calculates the nth
working day using holiday data.
Question:
How are deadlines calculated?
Step 9: LLM Generates Response
Output:
The system calculates deadlines using the nth working day logic and excludes holidays loaded from the holiday service.
RAG Prompt Template
You are a technical assistant.
Answer only from the context.
If the answer is unavailable, say:
"I don't know."
Context:
{documents}
Question:
{question}
This reduces hallucinations.
Naive RAG
Simple approach:
Question
↓
Vector Search
↓
Top 5 Chunks
↓
LLM
Works surprisingly well.
Advanced RAG
Enterprise systems often add:
- Metadata filters
- Hybrid search
- Re-ranking
- Context compression
- Citation generation
Metadata Filtering
Example:
Search only finance documents.
department = Finance
year = 2026
Improves precision.
Hybrid Search
Combines:
- Keyword search
- Vector search
Example:
AWS deployment
Keyword:
- AWS
Semantic:
- cloud deployment
Together:
Better results.
Re-ranking
Vector search may return:
- AWS deployment.
- Kubernetes.
- Docker.
Re-rankers improve relevance.
Context Compression
Suppose:
Retrieved:
10,000 tokens
LLM limit:
4,000 tokens
Compression removes irrelevant content.
Citations
Modern RAG systems provide:
According to Architecture.pdf page 8…
This increases trust.
Spring AI RAG Example
List<Document> docs =
vectorStore.similaritySearch(question);
String answer = chatClient.prompt()
.user(question)
.advisors(new QuestionAnswerAdvisor(vectorStore))
.call()
.content();
Very little code.
Powerful capabilities.
Architecture Example
Suppose you upload:
- HLD
- LLD
- Sequence diagrams
Ask:
Explain the request flow.
AI can answer:
- Gateway
- Mediator
- Choreo
- Atomic
based entirely on your documents.
Enterprise RAG Use Cases
Knowledge Assistant
Company policies.
Architecture Assistant
Technical documentation.
Production Support Assistant
Incident history.
API Assistant
Swagger documentation.
Compliance Assistant
Regulations and rules.
HR Assistant
Employee handbooks.
Problems in RAG
RAG is not magic.
Problems:
- Poor chunking.
- Bad embeddings.
- Wrong documents.
- Missing context.
- Duplicate chunks.
Garbage In, Garbage Out
Bad document:
Scanned image with OCR errors.
Bad results.
Good documents matter.
Hallucination Reduction
RAG reduces hallucinations because:
- Answers come from documents.
- Knowledge is grounded.
- Context is explicit.
Why RAG Beats Fine-Tuning
| Fine Tuning | RAG |
|---|---|
| Expensive | Cheaper |
| Slow updates | Instant updates |
| Retraining required | Just add documents |
| Difficult | Simpler |
Most enterprises choose:
RAG first.
Real Architecture
S3
↓
Document Loader
↓
Chunking Service
↓
Embedding Service
↓
PGVector
↓
Spring Boot API
↓
LLM
AWS Architecture
S3
↓
Lambda
↓
Titan Embeddings
↓
OpenSearch
↓
Bedrock
↓
Spring AI Service
Interview Questions
What is RAG?
Retrieval Augmented Generation combines document retrieval and LLM generation.
Why use RAG?
To provide external knowledge.
What reduces hallucinations?
Context.
Why chunk documents?
LLMs have context limits.
Why embeddings?
To perform semantic search.
Hands-On Exercise
Build:
Employee Policy Bot
Documents:
- Leave policy
- Travel policy
- WFH policy
Features:
- Upload PDF.
- Generate embeddings.
- Store vectors.
- Ask questions.
Project Ideas
Architecture Copilot
Upload:
- HLD
- LLD
- API specs
Ask:
Explain this design.
Meeting Assistant
Upload transcripts.
Ask:
What decisions were made?
Production Support Bot
Upload:
- RCA documents.
- Incident reports.
Ask:
Similar incidents?
Key Takeaways
✔ RAG gives AI external knowledge.
✔ Embeddings enable retrieval.
✔ Chunking affects accuracy.
✔ Vector databases enable semantic search.
✔ Context reduces hallucinations.
✔ Most enterprise AI systems are RAG systems.
Coming Next
Part 4 — Building Your First RAG Application Using Spring AI
We will build:
- Spring Boot application.
- PGVector integration.
- OpenAI embeddings.
- Document ingestion.
- Similarity search.
- Question answering API.
For the first time in this series, we will move from concepts to code and build a complete AI application using technologies familiar to Java developers.
“Databases store facts. RAG turns those facts into conversations.”