Chapter 2: Knowledge & Retrieval - AI & Agents Training

In this chapter

Memory RAG Architecture Embeddings & Vectors Fine-tuning vs RAG Data Quality Pipelines

Chapter 2

Knowledge & Retrieval. How AI stores, finds, and uses information.

Memory

Short-Term vs. Long-Term Memory in AI

Memory is what separates a useful AI assistant from a stateless text generator. Without memory, every conversation starts from zero. Understanding the types of memory — and their tradeoffs — is essential for building AI systems that maintain context and learn from interactions. AI models have two fundamental types of memory that affect how they process and retain information, each with distinct characteristics, limitations, and implementation strategies.

Short-Term Memory

Also called “context window memory.” This is the conversation history that the model can see in its current session. Everything in the prompt — system instructions, previous messages, retrieved documents — occupies short-term memory. The key limitation: when the context window fills up, older information must be dropped or summarized.

Strategies for managing short-term memory:

Sliding window: Drop the oldest messages as new ones arrive, keeping only the most recent exchanges visible to the model
Summarization: Compress older conversation history into a concise summary that preserves key facts while freeing up context space for new information
Priority-based: Keep important context (system instructions, critical user preferences) while dropping routine messages like greetings or acknowledgments

Long-Term Memory

Persistent memory that survives across sessions. Implemented through external storage — vector databases, knowledge graphs, or structured databases. When a user returns days later, the system retrieves relevant past interactions, preferences, and context.

Examples of long-term memory in practice: remembering that a user prefers technical explanations over simplified ones, storing project-specific terminology and acronyms, tracking ongoing workflows across sessions so the assistant can pick up where it left off.

The challenge: deciding what to remember, what to forget, and how to keep stored memory accurate as information changes. Stale memories can be worse than no memory at all — an assistant that remembers outdated preferences or incorrect facts actively harms the user experience.

Memory in Practice

Modern AI products implement memory differently. ChatGPT's Memory feature stores user preferences as text snippets that are injected into future conversations. Claude's Projects provide persistent context through uploaded documents that ground every response in project-specific knowledge. Enterprise systems often combine vector databases (for semantic retrieval of past interactions) with structured databases (for user profiles and preferences). The right approach depends on your use case — a customer support bot needs to remember ticket history, while a coding assistant needs to remember project architecture and conventions.

RAG Architecture

Combining Knowledge and Creativity

Retrieval-Augmented Generation (RAG) is a powerful architecture that combines retrieval-based and generative approaches to build accurate and reliable AI assistants. Let's explore the evolution from Native RAG to Agentic RAG.

Native RAG: The Foundation

Native RAG, the current standard in AI-powered information retrieval, employs a straightforward yet effective pipeline:

This linear process combines retrieval and generation methods to deliver contextually relevant answers, setting the stage for more advanced implementations.

How Native RAG Works

1. Retrieval

The system searches a knowledge base (e.g., company documents) for relevant information

2. Generation

The retrieved information is fed into a generative model (e.g., GPT) to produce a coherent answer

Benefits of Native RAG

It allows the assistant to pull accurate information from the knowledge base and generate human-like responses
Reduces hallucinations by grounding responses in factual information
Combines the best of both worlds: the accuracy of retrieval-based systems and the fluency of generative models
Can be updated with new information without retraining the entire model

Agentic RAG: The Game Changer

Agentic RAG is an advanced, agent-based approach to question answering over multiple documents in a coordinated manner. It introduces a network of AI agents and decision points, creating a more dynamic and adaptable system:

This advanced approach introduces intelligent routing, adaptive processing, and continuous improvement mechanisms, creating a system capable of handling complex queries and continuously improving its responses.

Key Components and Architecture

Document Agents

Each document is assigned a dedicated agent capable of answering questions and summarizing within its own document.

Meta-Agent

A top-level agent manages all the document agents, orchestrating their interactions and integrating their outputs to generate a coherent and comprehensive response.

The Agentic Advantage

Intelligent Routing

The system can determine whether to use internal knowledge, seek external information, or leverage language models based on the query type.

Adaptive Processing

Incorporates relevance checks and query rewriting capabilities to refine and improve the information retrieval process dynamically.

Features and Benefits

Enhanced Accuracy

By incorporating multiple checkpoints and decision points, Agentic RAG significantly improves the relevance and accuracy of responses.

Flexible Knowledge

Seamlessly combines internal data, web searches, and language model capabilities to provide comprehensive answers.

Continuous Learning

The query rewrite mechanism allows the system to learn and adapt, improving its performance over time with each interaction.

RAG in Code

Here is a complete working example of a RAG pipeline using LangChain, OpenAI embeddings, and ChromaDB. This code loads a document, splits it into chunks, embeds them into a vector store, and answers questions grounded in the retrieved context.

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# 1. Load and chunk documents
loader = TextLoader("company_docs.txt")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)

# 2. Create embeddings and store in vector database
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)

# 3. Create retrieval chain
llm = ChatOpenAI(model="gpt-4o")
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)

# 4. Ask a question
result = qa_chain.invoke({"query": "What is our refund policy?"})
print(result["result"])

Applications

Agentic RAG is particularly useful in scenarios requiring thorough and nuanced information processing and decision-making, such as:

Complex research tasks requiring synthesis across multiple documents
Legal document analysis and contract review
Medical literature review for clinical decision support
Financial analysis and investment research
Technical troubleshooting across multiple knowledge sources

Embeddings & Vectors

Embeddings are the bridge between human language and machine understanding. They convert text into numerical vectors that capture semantic meaning — making it possible for AI systems to find, compare, and reason about information. They're the foundation of RAG, semantic search, and recommendation systems.

How Embeddings Work

An embedding model transforms text into a list of numbers (a vector), typically ranging from 768 to 3072 dimensions. Texts with similar meaning produce similar vectors. For example, "The cat sat on the mat" and "A feline rested on the rug" would have vectors that are close together in this high-dimensional space, even though they share almost no words.

The distance between vectors represents semantic similarity. This is measured using cosine similarity — a value from -1 to 1, where 1 means identical meaning. By converting language into geometry, embedding models allow machines to perform operations on meaning itself: finding the closest match, clustering related ideas, or detecting outliers.

Key Embedding Models

OpenAI text-embedding-3

The text-embedding-3-large variant produces 3072-dimension vectors and delivers the best overall quality in benchmarks. The text-embedding-3-small (1536 dimensions) is a cost-sensitive alternative. Both support Matryoshka dimensionality — you can truncate vectors to fewer dimensions with minimal quality loss, letting you trade storage for accuracy on a sliding scale.

Cohere Embed v4

Multimodal — embeds both text and images into the same vector space, enabling cross-modal search. Supports 100+ languages natively and offers Matryoshka dimensions (256–1536). Cohere also provides fine-tuned variants optimized for finance, healthcare, and manufacturing use cases.

Open-Source Options

Models like BGE (BAAI), E5 (Microsoft), and GTE (Alibaba) deliver competitive quality and are free to run locally. They are an excellent choice for privacy-sensitive applications where data cannot leave your infrastructure, and they eliminate per-token embedding costs entirely.

Similarity Search Algorithms

Finding the nearest vectors in a database of millions is computationally expensive. A brute-force exact search through 10 million vectors might take seconds — far too slow for real-time applications.

Algorithms like HNSW (Hierarchical Navigable Small World) solve this by creating graph structures for approximate nearest-neighbor search. HNSW builds a multi-layered graph where each node connects to its nearby neighbors. At query time, the algorithm navigates this graph from coarse to fine layers, converging on the nearest results in milliseconds while trading only a tiny amount of accuracy for massive speed gains.

Vector Databases

Pinecone

Fully managed, serverless architecture that scales to billions of vectors without ops overhead. Pinecone is the most popular choice for teams that want to focus on their application rather than infrastructure. It supports metadata filtering, namespaces for multi-tenant isolation, and hybrid search that combines vector similarity with keyword matching for higher-precision retrieval.

Weaviate

Combines vector search with knowledge graph capabilities, exposed through a GraphQL interface. Weaviate can store objects with properties alongside their vectors, enabling complex queries like "find similar documents written after 2024 about machine learning." It also supports generative search — retrieve and transform results in one query, letting you pipe retrieval results directly into an LLM for summarization or synthesis.

Qdrant

Open-source and written in Rust for maximum performance, Qdrant delivers the fastest p99 latency at 2ms. It is one of the few vector databases that is ACID-compliant, meaning your data is always consistent even during crashes or concurrent writes. Qdrant supports rich filtering, geo-search, and recommendation APIs out of the box, making it the best choice when performance and data integrity are critical.

Chroma

Designed for simplicity with a Python-first approach that embeds directly in your application. Chroma is the best option for prototyping and smaller teams who need to move fast. It can run entirely in-memory for development, then switch to persistent storage for production — no configuration changes needed. The API is minimal and intuitive, getting you from zero to working semantic search in minutes.

Milvus

Open-source with strong throughput reaching 8,000 queries per second. Milvus supports multiple index types (IVF, HNSW, DiskANN) and GPU acceleration for even faster search. It is well-suited for large-scale deployments with high throughput requirements, and its distributed architecture can scale horizontally across clusters for enterprise workloads.

pgvector

A PostgreSQL extension that adds vector operations to your existing Postgres database. No new infrastructure needed — if you already use Postgres, just add the extension. pgvector supports IVFFlat and HNSW indexes for fast approximate search. It is the perfect choice when you want vector search without introducing a new database into your stack, and it lets you join vector results with your existing relational data in a single query.

The RAG Pipeline

Embeddings power every stage of a Retrieval-Augmented Generation pipeline. Here is the full flow from document ingestion to answer generation:

Indexing

Split documents into chunks (typically 256–1024 tokens), embed each chunk using an embedding model, and store the resulting vectors along with their metadata in a vector database. This is a one-time (or periodic) batch process.

Query

When a user asks a question, embed the question using the same embedding model that was used during indexing. This produces a query vector in the same semantic space as your document vectors.

Retrieval

Find the top-K most similar document chunks by comparing the query vector against all stored vectors using similarity search (typically cosine similarity with an HNSW index for speed).

Reranking (optional)

Use a cross-encoder model to re-score and reorder the retrieved chunks for better relevance. Cross-encoders are more accurate than bi-encoders but slower, so they are applied only to the small set of candidates returned by step 3.

Generation

Pass the retrieved chunks as context to the LLM, which generates an answer grounded in the retrieved information. The prompt typically instructs the model to only use the provided context, reducing hallucinations.

A Note on Chunking Strategy

Chunk size matters — too small and you lose context, too large and you dilute relevance. A common starting point is 512 tokens with a 50-token overlap between consecutive chunks. The overlap ensures that information spanning a chunk boundary is still captured in at least one chunk. Experiment with your specific data: technical documentation often benefits from larger chunks (1024 tokens), while FAQ-style content works well with smaller ones (256 tokens).

Fine-tuning vs RAG

Fine-tuning and RAG are the two primary approaches for customizing LLM behavior beyond what prompting alone can achieve. They solve different problems, and the 2025 consensus is that most production systems should use both.

What Is Fine-Tuning?

Fine-tuning takes a pre-trained model and trains it further on your specific data. This modifies the model's weights — the knowledge becomes part of the model itself. It's like teaching someone a new skill through practice: after enough repetition, the behavior becomes second nature and no longer requires external reference.

Common use cases include adapting tone and style (e.g., matching your brand voice), teaching domain-specific terminology the base model doesn't know, improving performance on specialized tasks like medical coding or legal classification, and reducing prompt length by baking instructions into the model so you don't need to repeat them every call.

What Is RAG?

RAG (Retrieval-Augmented Generation) keeps the model unchanged and instead augments each query with relevant external information retrieved at inference time. It's like giving someone a reference book to consult before answering — the person's reasoning ability stays the same, but they now have access to specific facts they can look up.

RAG excels at knowledge-intensive tasks where accuracy matters, situations requiring up-to-date information that changes frequently, auditable responses with source citations (critical in regulated industries), and large document collections that would be impractical to bake into model weights.

Detailed Comparison

Dimension	Fine-Tuning	RAG
How it works	Modifies model weights through additional training on your data. The knowledge becomes part of the model — it "learns" patterns, terminology, and behaviors from your dataset and applies them automatically at inference time.	Retrieves relevant information from external sources at query time and includes it in the prompt context. The model itself is unchanged; it simply receives better input to reason over for each specific question.
Cost structure	High upfront cost — GPU compute for training runs, data preparation and curation effort, and iteration cycles to get the fine-tune right. However, per-query cost is lower since there is no retrieval step and prompts can be shorter.	Lower upfront cost since no training is required. Per-query cost includes retrieval infrastructure (vector database hosting, embedding API calls) plus larger prompts due to injected context, which increases token usage.
Knowledge updates	Requires retraining to incorporate new information. A model fine-tuned on January data doesn't know about February events. Each update means another training run, validation cycle, and deployment.	Update source documents — no retraining needed. The model always retrieves the latest information. You can add, modify, or remove documents from the vector store at any time and the changes take effect immediately.
Factual accuracy	The model "hopes" it internalized the fact correctly during training. Hard to verify what the model actually learned vs. what it confabulates. Facts can degrade or blur together, especially for rare or specific data points.	Retrieves facts directly from documents. Can cite sources. If the answer exists in your documents, RAG will find it. Accuracy is bounded by retrieval quality — if the right chunk is retrieved, the answer is almost always correct.
Hallucination risk	Can still hallucinate, especially on edge cases not well-represented in fine-tuning data. The model may confidently generate plausible-sounding but incorrect information when it encounters gaps in its training.	Significantly reduced because answers are grounded in retrieved documents. Not eliminated — the model can still misinterpret retrieved context or fill gaps with fabricated details, but the grounding substantially limits this.
Transparency	Black box — difficult to trace where specific knowledge came from. You cannot point to a source document for any given claim the model makes. This is a significant barrier in regulated industries like healthcare and finance.	Full attribution — every response can cite the exact documents and passages used. You can build audit trails showing which sources informed each answer. Essential for compliance, legal review, and building user trust.
Latency	Fast inference — no retrieval step needed. The model generates responses directly from its weights, so latency is just the generation time. Ideal for real-time applications where every millisecond counts.	Additional latency from the retrieval step, typically 50–200ms. Optimizable with caching, efficient HNSW indexes, and pre-computed results for common queries. For most applications, the added latency is imperceptible to users.
Data security	Training data is baked into weights. If the model is shared or deployed externally, the knowledge goes with it. Sensitive information can potentially be extracted through adversarial prompting techniques.	Data stays in your secured environment. Access controls determine what each user can retrieve. You can implement row-level security, role-based access, and per-query permission checks without touching the model.
Best for	Behavioral consistency (tone, style, format), domain-specific language and terminology, specialized classification tasks, reducing prompt sizes, and offline or edge deployment scenarios.	Knowledge-intensive applications, frequently updated information, auditable and compliant responses, large document collections, and multi-tenant systems where different users need different knowledge bases.

The Hybrid Approach: 2025 Best Practice

The 2025 consensus among production AI teams is to combine both approaches, each handling what it does best:

Fine-tune for BEHAVIOR — how the model responds. This includes tone, style, format, domain terminology, and task-specific patterns. Fine-tuning ensures the model consistently writes in your brand voice, follows your output schema, and uses the right jargon without being told every time.
RAG for KNOWLEDGE — what the model knows. This includes company documents, product data, policies, recent events, and any information that changes over time. RAG ensures the model always has access to the latest facts and can cite its sources.

Example: A legal AI assistant might be fine-tuned to write in formal legal language, follow specific citation formats (Bluebook vs. ALWD), and structure arguments in IRAC format. Meanwhile, RAG retrieves relevant case law, statutes, and regulatory guidance for each specific query. The fine-tuned behavior ensures professional, consistent output; the RAG layer ensures factual accuracy and up-to-date legal references.

This gives you the best of both worlds: consistent, professional behavior with accurate, up-to-date, citable information. Neither approach alone achieves this — fine-tuning without RAG produces confident but potentially outdated answers, while RAG without fine-tuning produces accurate but stylistically inconsistent responses.

When to Choose Which

Choose Fine-Tuning When

You need consistent style, tone, or format across all outputs
The model must use domain-specific behavior and terminology natively
You want to reduce prompt sizes (and therefore per-query costs) by baking instructions into the model
The model will be deployed offline or at the edge without access to retrieval infrastructure
You have a well-defined classification or extraction task with clear training examples

Choose RAG When

Knowledge changes frequently and you cannot retrain on every update
You need source citations and audit trails for compliance or trust
Data is sensitive and must remain in your secured environment with access controls
You have large document collections (thousands to millions of pages)
Different users or tenants need access to different knowledge bases

Choose Both When

Building production applications that need both behavioral consistency AND factual grounding
Your use case demands professional-grade output quality with verifiable accuracy
You are in a regulated industry (healthcare, finance, legal) where both precision and auditability matter
You want to minimize hallucinations while maintaining a distinctive, on-brand voice
You are building a customer-facing product where trust and reliability are paramount

Data Quality

Data Labeling and Cleaning: The Foundation of AI

The quality of the assistant depends on the quality of the data. Without clean, well-labeled data, even the most sophisticated AI models will struggle to provide accurate and helpful responses.

Data Labeling

Adding meaningful tags or annotations to data to help the AI understand its structure and meaning. Proper labeling transforms raw text into structured, machine-readable information that the AI can navigate efficiently.

Labeling questions and answers in a FAQ dataset so the retrieval system can match user queries to the most relevant answers, even when phrased differently
Categorizing documents by department or topic, enabling filtered searches that return only results from the relevant domain (e.g., HR policies vs. engineering docs)
Marking entities like names, dates, and locations in text through Named Entity Recognition (NER), which allows the system to extract structured facts from unstructured prose
Annotating sentiment (positive, negative, neutral) in customer feedback so the AI can prioritize urgent complaints and route them to the appropriate team

Data Cleaning

Preparing data by removing errors, inconsistencies, and irrelevant information. Cleaning is often the most time-consuming step in a data pipeline, but it has the highest return on investment.

Removing duplicates to prevent bias in training — if the same document appears three times, the model over-weights its content and may parrot it even when irrelevant
Correcting spelling and grammatical errors that could confuse the embedding model and cause semantically similar content to end up far apart in vector space
Standardizing formats (dates, phone numbers, currencies) so that “Jan 5, 2025” and “2025-01-05” are treated as the same date rather than different strings
Handling missing values appropriately — either filling them with sensible defaults, flagging them explicitly, or removing incomplete records depending on the use case
Removing personally identifiable information (PII) for privacy and compliance, using techniques like regex-based detection, NER models, or dedicated PII scrubbing tools

Why Data Quality Matters

The principle “garbage in, garbage out” applies exponentially to AI. A model trained on biased data produces biased outputs. A RAG system searching poorly labeled documents returns irrelevant results. Data quality isn't just a preprocessing step — it's the foundation that determines the ceiling of your AI system's performance.

Real-world examples of data quality failures:

Mislabeled training data caused a medical AI system to misdiagnose certain skin conditions at higher rates for patients with darker skin tones, because the training labels did not account for how conditions present differently across skin types
Duplicate documents in a RAG system caused contradictory answers — when the same policy document existed in three versions (2021, 2023, and 2024), the retriever sometimes surfaced the outdated version, giving users incorrect information about current procedures
Outdated information being presented as current: a customer support bot trained on old pricing data continued quoting discontinued plans to customers, creating billing disputes and eroding trust

The Impact of Data Quality

The quality of data directly affects the performance of the AI assistant in several ways:

Accuracy: Clean, well-labeled data leads to more accurate responses. When the underlying data is correct and well-structured, the model can retrieve and synthesize information with high fidelity.
Relevance: Properly categorized data helps the assistant find the most relevant information. Metadata like departments, dates, and document types act as filters that dramatically improve retrieval precision.
Bias Reduction: Careful data preparation helps minimize biases in the assistant's responses. Auditing training data for demographic representation, viewpoint balance, and factual accuracy is essential for building fair AI systems.
Efficiency: Well-structured data enables faster retrieval and processing. Clean embeddings cluster more meaningfully in vector space, reducing the number of irrelevant results and improving response latency.

Pipelines

Building Data Pipelines: Automating the Workflow

To streamline the process, we build data pipelines that automate the collection, cleaning, and processing of data. Tools like Apache Airflow and Pandas help us create efficient workflows for managing the data that powers our AI assistant.

Data Collection

Gather data from various sources, including:

Internal documents and knowledge bases
Customer support tickets and FAQs
Product documentation and specifications
Employee feedback and questions

Data Cleaning and Preprocessing

Prepare the data for use in the AI system:

Remove duplicates and irrelevant information
Standardize formats and correct errors
Extract key entities and relationships
Convert data into a consistent format

Data Storage

Store the processed data in a structured format for easy retrieval:

Vector databases for semantic search
Document databases for structured content
Knowledge graphs for complex relationships
Metadata indexes for efficient filtering

Key Tools for Data Pipelines

Apache Airflow
A platform to programmatically author, schedule, and monitor workflows. It allows you to define complex data pipelines as code and visualize their execution.
Pandas
A Python library for data manipulation and analysis. It provides data structures and functions needed to manipulate structured data efficiently.
Apache Spark
A unified analytics engine for large-scale data processing. It can handle massive datasets across distributed clusters.
Elasticsearch
A distributed, RESTful search and analytics engine capable of addressing a growing number of use cases. It can be used to store and retrieve documents efficiently.

The Foundation

Advanced Intelligence