Chapter 2
Knowledge & Retrieval. How AI stores, finds, and uses information.
Memory
Short-Term vs. Long-Term Memory in AI
Memory is what separates a useful AI assistant from a stateless text generator. Without memory, every conversation starts from zero. Understanding the types of memory — and their tradeoffs — is essential for building AI systems that maintain context and learn from interactions. AI models have two fundamental types of memory that affect how they process and retain information, each with distinct characteristics, limitations, and implementation strategies.
Also called “context window memory.” This is the conversation history that the model can see in its current session. Everything in the prompt — system instructions, previous messages, retrieved documents — occupies short-term memory. The key limitation: when the context window fills up, older information must be dropped or summarized.
Strategies for managing short-term memory:
- Sliding window: Drop the oldest messages as new ones arrive, keeping only the most recent exchanges visible to the model
- Summarization: Compress older conversation history into a concise summary that preserves key facts while freeing up context space for new information
- Priority-based: Keep important context (system instructions, critical user preferences) while dropping routine messages like greetings or acknowledgments
Persistent memory that survives across sessions. Implemented through external storage — vector databases, knowledge graphs, or structured databases. When a user returns days later, the system retrieves relevant past interactions, preferences, and context.
Examples of long-term memory in practice: remembering that a user prefers technical explanations over simplified ones, storing project-specific terminology and acronyms, tracking ongoing workflows across sessions so the assistant can pick up where it left off.
The challenge: deciding what to remember, what to forget, and how to keep stored memory accurate as information changes. Stale memories can be worse than no memory at all — an assistant that remembers outdated preferences or incorrect facts actively harms the user experience.
Modern AI products implement memory differently. ChatGPT's Memory feature stores user preferences as text snippets that are injected into future conversations. Claude's Projects provide persistent context through uploaded documents that ground every response in project-specific knowledge. Enterprise systems often combine vector databases (for semantic retrieval of past interactions) with structured databases (for user profiles and preferences). The right approach depends on your use case — a customer support bot needs to remember ticket history, while a coding assistant needs to remember project architecture and conventions.
RAG Architecture
Combining Knowledge and Creativity
Retrieval-Augmented Generation (RAG) is a powerful architecture that combines retrieval-based and generative approaches to build accurate and reliable AI assistants. Let's explore the evolution from Native RAG to Agentic RAG.
Native RAG: The Foundation
Native RAG, the current standard in AI-powered information retrieval, employs a straightforward yet effective pipeline:
This linear process combines retrieval and generation methods to deliver contextually relevant answers, setting the stage for more advanced implementations.
How Native RAG Works
1. Retrieval
The system searches a knowledge base (e.g., company documents) for relevant information
2. Generation
The retrieved information is fed into a generative model (e.g., GPT) to produce a coherent answer
Benefits of Native RAG
- It allows the assistant to pull accurate information from the knowledge base and generate human-like responses
- Reduces hallucinations by grounding responses in factual information
- Combines the best of both worlds: the accuracy of retrieval-based systems and the fluency of generative models
- Can be updated with new information without retraining the entire model
Agentic RAG: The Game Changer
Agentic RAG is an advanced, agent-based approach to question answering over multiple documents in a coordinated manner. It introduces a network of AI agents and decision points, creating a more dynamic and adaptable system:
This advanced approach introduces intelligent routing, adaptive processing, and continuous improvement mechanisms, creating a system capable of handling complex queries and continuously improving its responses.
Key Components and Architecture
Each document is assigned a dedicated agent capable of answering questions and summarizing within its own document.
A top-level agent manages all the document agents, orchestrating their interactions and integrating their outputs to generate a coherent and comprehensive response.
The Agentic Advantage
The system can determine whether to use internal knowledge, seek external information, or leverage language models based on the query type.
Incorporates relevance checks and query rewriting capabilities to refine and improve the information retrieval process dynamically.
Features and Benefits
By incorporating multiple checkpoints and decision points, Agentic RAG significantly improves the relevance and accuracy of responses.
Seamlessly combines internal data, web searches, and language model capabilities to provide comprehensive answers.
The query rewrite mechanism allows the system to learn and adapt, improving its performance over time with each interaction.
RAG in Code
Here is a complete working example of a RAG pipeline using LangChain, OpenAI embeddings, and ChromaDB. This code loads a document, splits it into chunks, embeds them into a vector store, and answers questions grounded in the retrieved context.
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# 1. Load and chunk documents
loader = TextLoader("company_docs.txt")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
# 2. Create embeddings and store in vector database
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)
# 3. Create retrieval chain
llm = ChatOpenAI(model="gpt-4o")
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
return_source_documents=True
)
# 4. Ask a question
result = qa_chain.invoke({"query": "What is our refund policy?"})
print(result["result"])Applications
Agentic RAG is particularly useful in scenarios requiring thorough and nuanced information processing and decision-making, such as:
- Complex research tasks requiring synthesis across multiple documents
- Legal document analysis and contract review
- Medical literature review for clinical decision support
- Financial analysis and investment research
- Technical troubleshooting across multiple knowledge sources
Embeddings & Vectors
Embeddings are the bridge between human language and machine understanding. They convert text into numerical vectors that capture semantic meaning — making it possible for AI systems to find, compare, and reason about information. They're the foundation of RAG, semantic search, and recommendation systems.
How Embeddings Work
An embedding model transforms text into a list of numbers (a vector), typically ranging from 768 to 3072 dimensions. Texts with similar meaning produce similar vectors. For example, "The cat sat on the mat" and "A feline rested on the rug" would have vectors that are close together in this high-dimensional space, even though they share almost no words.
The distance between vectors represents semantic similarity. This is measured using cosine similarity — a value from -1 to 1, where 1 means identical meaning. By converting language into geometry, embedding models allow machines to perform operations on meaning itself: finding the closest match, clustering related ideas, or detecting outliers.
Key Embedding Models
The text-embedding-3-large variant produces 3072-dimension vectors and delivers the best overall quality in benchmarks. The text-embedding-3-small (1536 dimensions) is a cost-sensitive alternative. Both support Matryoshka dimensionality — you can truncate vectors to fewer dimensions with minimal quality loss, letting you trade storage for accuracy on a sliding scale.
Multimodal — embeds both text and images into the same vector space, enabling cross-modal search. Supports 100+ languages natively and offers Matryoshka dimensions (256–1536). Cohere also provides fine-tuned variants optimized for finance, healthcare, and manufacturing use cases.
Models like BGE (BAAI), E5 (Microsoft), and GTE (Alibaba) deliver competitive quality and are free to run locally. They are an excellent choice for privacy-sensitive applications where data cannot leave your infrastructure, and they eliminate per-token embedding costs entirely.
Similarity Search Algorithms
Finding the nearest vectors in a database of millions is computationally expensive. A brute-force exact search through 10 million vectors might take seconds — far too slow for real-time applications.
Algorithms like HNSW (Hierarchical Navigable Small World) solve this by creating graph structures for approximate nearest-neighbor search. HNSW builds a multi-layered graph where each node connects to its nearby neighbors. At query time, the algorithm navigates this graph from coarse to fine layers, converging on the nearest results in milliseconds while trading only a tiny amount of accuracy for massive speed gains.
Vector Databases
Fully managed, serverless architecture that scales to billions of vectors without ops overhead. Pinecone is the most popular choice for teams that want to focus on their application rather than infrastructure. It supports metadata filtering, namespaces for multi-tenant isolation, and hybrid search that combines vector similarity with keyword matching for higher-precision retrieval.
Combines vector search with knowledge graph capabilities, exposed through a GraphQL interface. Weaviate can store objects with properties alongside their vectors, enabling complex queries like "find similar documents written after 2024 about machine learning." It also supports generative search — retrieve and transform results in one query, letting you pipe retrieval results directly into an LLM for summarization or synthesis.
Open-source and written in Rust for maximum performance, Qdrant delivers the fastest p99 latency at 2ms. It is one of the few vector databases that is ACID-compliant, meaning your data is always consistent even during crashes or concurrent writes. Qdrant supports rich filtering, geo-search, and recommendation APIs out of the box, making it the best choice when performance and data integrity are critical.
Designed for simplicity with a Python-first approach that embeds directly in your application. Chroma is the best option for prototyping and smaller teams who need to move fast. It can run entirely in-memory for development, then switch to persistent storage for production — no configuration changes needed. The API is minimal and intuitive, getting you from zero to working semantic search in minutes.
Open-source with strong throughput reaching 8,000 queries per second. Milvus supports multiple index types (IVF, HNSW, DiskANN) and GPU acceleration for even faster search. It is well-suited for large-scale deployments with high throughput requirements, and its distributed architecture can scale horizontally across clusters for enterprise workloads.
A PostgreSQL extension that adds vector operations to your existing Postgres database. No new infrastructure needed — if you already use Postgres, just add the extension. pgvector supports IVFFlat and HNSW indexes for fast approximate search. It is the perfect choice when you want vector search without introducing a new database into your stack, and it lets you join vector results with your existing relational data in a single query.
The RAG Pipeline
Embeddings power every stage of a Retrieval-Augmented Generation pipeline. Here is the full flow from document ingestion to answer generation:
Indexing
Split documents into chunks (typically 256–1024 tokens), embed each chunk using an embedding model, and store the resulting vectors along with their metadata in a vector database. This is a one-time (or periodic) batch process.
Query
When a user asks a question, embed the question using the same embedding model that was used during indexing. This produces a query vector in the same semantic space as your document vectors.
Retrieval
Find the top-K most similar document chunks by comparing the query vector against all stored vectors using similarity search (typically cosine similarity with an HNSW index for speed).
Reranking (optional)
Use a cross-encoder model to re-score and reorder the retrieved chunks for better relevance. Cross-encoders are more accurate than bi-encoders but slower, so they are applied only to the small set of candidates returned by step 3.
Generation
Pass the retrieved chunks as context to the LLM, which generates an answer grounded in the retrieved information. The prompt typically instructs the model to only use the provided context, reducing hallucinations.
A Note on Chunking Strategy
Chunk size matters — too small and you lose context, too large and you dilute relevance. A common starting point is 512 tokens with a 50-token overlap between consecutive chunks. The overlap ensures that information spanning a chunk boundary is still captured in at least one chunk. Experiment with your specific data: technical documentation often benefits from larger chunks (1024 tokens), while FAQ-style content works well with smaller ones (256 tokens).
Fine-tuning vs RAG
Fine-tuning and RAG are the two primary approaches for customizing LLM behavior beyond what prompting alone can achieve. They solve different problems, and the 2025 consensus is that most production systems should use both.
What Is Fine-Tuning?
Fine-tuning takes a pre-trained model and trains it further on your specific data. This modifies the model's weights — the knowledge becomes part of the model itself. It's like teaching someone a new skill through practice: after enough repetition, the behavior becomes second nature and no longer requires external reference.
Common use cases include adapting tone and style (e.g., matching your brand voice), teaching domain-specific terminology the base model doesn't know, improving performance on specialized tasks like medical coding or legal classification, and reducing prompt length by baking instructions into the model so you don't need to repeat them every call.
What Is RAG?
RAG (Retrieval-Augmented Generation) keeps the model unchanged and instead augments each query with relevant external information retrieved at inference time. It's like giving someone a reference book to consult before answering — the person's reasoning ability stays the same, but they now have access to specific facts they can look up.
RAG excels at knowledge-intensive tasks where accuracy matters, situations requiring up-to-date information that changes frequently, auditable responses with source citations (critical in regulated industries), and large document collections that would be impractical to bake into model weights.
Detailed Comparison
| Dimension | Fine-Tuning | RAG |
|---|---|---|
| How it works | Modifies model weights through additional training on your data. The knowledge becomes part of the model — it "learns" patterns, terminology, and behaviors from your dataset and applies them automatically at inference time. | Retrieves relevant information from external sources at query time and includes it in the prompt context. The model itself is unchanged; it simply receives better input to reason over for each specific question. |
| Cost structure | High upfront cost — GPU compute for training runs, data preparation and curation effort, and iteration cycles to get the fine-tune right. However, per-query cost is lower since there is no retrieval step and prompts can be shorter. | Lower upfront cost since no training is required. Per-query cost includes retrieval infrastructure (vector database hosting, embedding API calls) plus larger prompts due to injected context, which increases token usage. |
| Knowledge updates | Requires retraining to incorporate new information. A model fine-tuned on January data doesn't know about February events. Each update means another training run, validation cycle, and deployment. | Update source documents — no retraining needed. The model always retrieves the latest information. You can add, modify, or remove documents from the vector store at any time and the changes take effect immediately. |
| Factual accuracy | The model "hopes" it internalized the fact correctly during training. Hard to verify what the model actually learned vs. what it confabulates. Facts can degrade or blur together, especially for rare or specific data points. | Retrieves facts directly from documents. Can cite sources. If the answer exists in your documents, RAG will find it. Accuracy is bounded by retrieval quality — if the right chunk is retrieved, the answer is almost always correct. |
| Hallucination risk | Can still hallucinate, especially on edge cases not well-represented in fine-tuning data. The model may confidently generate plausible-sounding but incorrect information when it encounters gaps in its training. | Significantly reduced because answers are grounded in retrieved documents. Not eliminated — the model can still misinterpret retrieved context or fill gaps with fabricated details, but the grounding substantially limits this. |
| Transparency | Black box — difficult to trace where specific knowledge came from. You cannot point to a source document for any given claim the model makes. This is a significant barrier in regulated industries like healthcare and finance. | Full attribution — every response can cite the exact documents and passages used. You can build audit trails showing which sources informed each answer. Essential for compliance, legal review, and building user trust. |
| Latency | Fast inference — no retrieval step needed. The model generates responses directly from its weights, so latency is just the generation time. Ideal for real-time applications where every millisecond counts. | Additional latency from the retrieval step, typically 50–200ms. Optimizable with caching, efficient HNSW indexes, and pre-computed results for common queries. For most applications, the added latency is imperceptible to users. |
| Data security | Training data is baked into weights. If the model is shared or deployed externally, the knowledge goes with it. Sensitive information can potentially be extracted through adversarial prompting techniques. | Data stays in your secured environment. Access controls determine what each user can retrieve. You can implement row-level security, role-based access, and per-query permission checks without touching the model. |
| Best for | Behavioral consistency (tone, style, format), domain-specific language and terminology, specialized classification tasks, reducing prompt sizes, and offline or edge deployment scenarios. | Knowledge-intensive applications, frequently updated information, auditable and compliant responses, large document collections, and multi-tenant systems where different users need different knowledge bases. |
The Hybrid Approach: 2025 Best Practice
The 2025 consensus among production AI teams is to combine both approaches, each handling what it does best:
- Fine-tune for BEHAVIOR — how the model responds. This includes tone, style, format, domain terminology, and task-specific patterns. Fine-tuning ensures the model consistently writes in your brand voice, follows your output schema, and uses the right jargon without being told every time.
- RAG for KNOWLEDGE — what the model knows. This includes company documents, product data, policies, recent events, and any information that changes over time. RAG ensures the model always has access to the latest facts and can cite its sources.
Example: A legal AI assistant might be fine-tuned to write in formal legal language, follow specific citation formats (Bluebook vs. ALWD), and structure arguments in IRAC format. Meanwhile, RAG retrieves relevant case law, statutes, and regulatory guidance for each specific query. The fine-tuned behavior ensures professional, consistent output; the RAG layer ensures factual accuracy and up-to-date legal references.
This gives you the best of both worlds: consistent, professional behavior with accurate, up-to-date, citable information. Neither approach alone achieves this — fine-tuning without RAG produces confident but potentially outdated answers, while RAG without fine-tuning produces accurate but stylistically inconsistent responses.
When to Choose Which
- You need consistent style, tone, or format across all outputs
- The model must use domain-specific behavior and terminology natively
- You want to reduce prompt sizes (and therefore per-query costs) by baking instructions into the model
- The model will be deployed offline or at the edge without access to retrieval infrastructure
- You have a well-defined classification or extraction task with clear training examples
- Knowledge changes frequently and you cannot retrain on every update
- You need source citations and audit trails for compliance or trust
- Data is sensitive and must remain in your secured environment with access controls
- You have large document collections (thousands to millions of pages)
- Different users or tenants need access to different knowledge bases
- Building production applications that need both behavioral consistency AND factual grounding
- Your use case demands professional-grade output quality with verifiable accuracy
- You are in a regulated industry (healthcare, finance, legal) where both precision and auditability matter
- You want to minimize hallucinations while maintaining a distinctive, on-brand voice
- You are building a customer-facing product where trust and reliability are paramount
Data Quality
Data Labeling and Cleaning: The Foundation of AI
The quality of the assistant depends on the quality of the data. Without clean, well-labeled data, even the most sophisticated AI models will struggle to provide accurate and helpful responses.
Adding meaningful tags or annotations to data to help the AI understand its structure and meaning. Proper labeling transforms raw text into structured, machine-readable information that the AI can navigate efficiently.
- Labeling questions and answers in a FAQ dataset so the retrieval system can match user queries to the most relevant answers, even when phrased differently
- Categorizing documents by department or topic, enabling filtered searches that return only results from the relevant domain (e.g., HR policies vs. engineering docs)
- Marking entities like names, dates, and locations in text through Named Entity Recognition (NER), which allows the system to extract structured facts from unstructured prose
- Annotating sentiment (positive, negative, neutral) in customer feedback so the AI can prioritize urgent complaints and route them to the appropriate team
Preparing data by removing errors, inconsistencies, and irrelevant information. Cleaning is often the most time-consuming step in a data pipeline, but it has the highest return on investment.
- Removing duplicates to prevent bias in training — if the same document appears three times, the model over-weights its content and may parrot it even when irrelevant
- Correcting spelling and grammatical errors that could confuse the embedding model and cause semantically similar content to end up far apart in vector space
- Standardizing formats (dates, phone numbers, currencies) so that “Jan 5, 2025” and “2025-01-05” are treated as the same date rather than different strings
- Handling missing values appropriately — either filling them with sensible defaults, flagging them explicitly, or removing incomplete records depending on the use case
- Removing personally identifiable information (PII) for privacy and compliance, using techniques like regex-based detection, NER models, or dedicated PII scrubbing tools
Why Data Quality Matters
The principle “garbage in, garbage out” applies exponentially to AI. A model trained on biased data produces biased outputs. A RAG system searching poorly labeled documents returns irrelevant results. Data quality isn't just a preprocessing step — it's the foundation that determines the ceiling of your AI system's performance.
Real-world examples of data quality failures:
- Mislabeled training data caused a medical AI system to misdiagnose certain skin conditions at higher rates for patients with darker skin tones, because the training labels did not account for how conditions present differently across skin types
- Duplicate documents in a RAG system caused contradictory answers — when the same policy document existed in three versions (2021, 2023, and 2024), the retriever sometimes surfaced the outdated version, giving users incorrect information about current procedures
- Outdated information being presented as current: a customer support bot trained on old pricing data continued quoting discontinued plans to customers, creating billing disputes and eroding trust
The Impact of Data Quality
The quality of data directly affects the performance of the AI assistant in several ways:
- Accuracy: Clean, well-labeled data leads to more accurate responses. When the underlying data is correct and well-structured, the model can retrieve and synthesize information with high fidelity.
- Relevance: Properly categorized data helps the assistant find the most relevant information. Metadata like departments, dates, and document types act as filters that dramatically improve retrieval precision.
- Bias Reduction: Careful data preparation helps minimize biases in the assistant's responses. Auditing training data for demographic representation, viewpoint balance, and factual accuracy is essential for building fair AI systems.
- Efficiency: Well-structured data enables faster retrieval and processing. Clean embeddings cluster more meaningfully in vector space, reducing the number of irrelevant results and improving response latency.
Pipelines
Building Data Pipelines: Automating the Workflow
To streamline the process, we build data pipelines that automate the collection, cleaning, and processing of data. Tools like Apache Airflow and Pandas help us create efficient workflows for managing the data that powers our AI assistant.
Data Collection
Gather data from various sources, including:
- Internal documents and knowledge bases
- Customer support tickets and FAQs
- Product documentation and specifications
- Employee feedback and questions
Data Cleaning and Preprocessing
Prepare the data for use in the AI system:
- Remove duplicates and irrelevant information
- Standardize formats and correct errors
- Extract key entities and relationships
- Convert data into a consistent format
Data Storage
Store the processed data in a structured format for easy retrieval:
- Vector databases for semantic search
- Document databases for structured content
- Knowledge graphs for complex relationships
- Metadata indexes for efficient filtering
Key Tools for Data Pipelines
Apache Airflow
A platform to programmatically author, schedule, and monitor workflows. It allows you to define complex data pipelines as code and visualize their execution.
Pandas
A Python library for data manipulation and analysis. It provides data structures and functions needed to manipulate structured data efficiently.
Apache Spark
A unified analytics engine for large-scale data processing. It can handle massive datasets across distributed clusters.
Elasticsearch
A distributed, RESTful search and analytics engine capable of addressing a growing number of use cases. It can be used to store and retrieve documents efficiently.