RAG & Retrieval

Retrieval-Augmented Generation (RAG) is a technique that enhances large language models by grounding their responses in external knowledge sources. Instead of relying solely on what the model learned during training, RAG retrieves relevant documents at inference time and includes them as context.

The RAG pipeline typically involves three stages: indexing (chunking documents and creating vector embeddings), retrieval (finding the most relevant chunks for a given query using similarity search), and generation (feeding the retrieved context to an LLM to produce a grounded answer).

Key concepts include chunking strategies, embedding models, vector databases (Pinecone, ChromaDB, Weaviate), hybrid search (combining dense and sparse retrieval), re-ranking, and evaluation metrics like faithfulness and answer relevance. RAG is now a foundational pattern for building knowledge-grounded AI applications.

Overview

ML Concepts

Explain RAG (Retrieval-Augmented Generation). When and how would you use it?

When would you use fine-tuning vs RAG vs prompt engineering? How do you decide?

Deep-Dive Concepts (from Projects)

How Embeddings Work

Vector Similarity Explained

Chunking Strategies Compared

Two-Stage Retrieval Architecture

Prompt Engineering for RAG

Evaluating RAG Systems

AWS Bedrock Architecture Deep-Dive

RAG with Bedrock Knowledge Bases

Content Safety with Bedrock Guardrails

Gemini 1.5's 2M Token Context Window

Multi-Cloud AI Abstraction Patterns

Azure OpenAI Service Architecture

Three-Cloud Decision Framework

Cost Optimization Strategies for Cloud AI

BM25 Algorithm

Inverted Index

HNSW Algorithm

Hybrid Search Fusion

Two-Stage Retrieval

Search Evaluation Metrics

Query Understanding

RAG Architecture

Document Processing Pipelines

Multi-Agent vs Single Agent Systems

Verification Confidence Scoring