📚

LLM & RAG fundamentals

M1 · LLM and RAG Fundamentals

Module 1 — Week 1. Master the concepts that underpin everything else: what an LLM does internally, how you talk to it, why RAG exists, how the minimal pattern works, and how to choose the right model. By the end of this guide you will be able to complete the lab without help.

RAGorbit node: model.llm, model.embedding (category model). Reference template: 09-hr-policy-assistant.

What is an LLM? Tokens and context window
Temperature and other inference hyperparameters
Prompting: system, user, few-shot, CoT
Why RAG: hallucination, private data, fresh data
The minimal RAG pattern
Embeddings: geometric intuition and cosine similarity
Model selection and evaluation
Comparison: Claude vs OpenAI vs Gemini vs Llama/Mistral
RAG vs fine-tuning vs pure prompting: when to use each
Layer ③ explained: LangChain from scratch
Checkpoint

1. What is an LLM? Tokens and context window

1.1 The core idea

A Large Language Model (LLM) is a neural network trained to predict the next token given a prefix of tokens. During training it read a significant fraction of text published on the internet, books, and code — which gives it the illusion of "knowing" many things. At inference time (when you use it), it remembers nothing between calls: each call starts from zero.

The key is that the model does not search a database nor run any SQL query. It generates text token by token using probabilities learned from the language distribution. That has important consequences you will see in §4.

1.2 What is a token?

A token is not a word. It is a unit of text determined by the model's tokenizer (typically BPE — Byte Pair Encoding or similar variants). As a rule of thumb:

1 English word ≈ 1–2 tokens
1 Spanish word ≈ 1.3–2.5 tokens (richer morphology = more tokens)
1 exotic Unicode character can be 3–5 tokens
100 tokens ≈ 75 English words

Why it matters: models have a maximum token limit per call (the "context window"). If you exceed it, the model cannot process the input.

Texto:    "Los empleados tienen 15 días de vacaciones anuales."
Tokens:   ["Los", " emple", "ados", " tienen", " 15", " días", " de", " vac", "aciones", " anuales", "."]
Conteo:   11 tokens (aproximado — depende del tokenizador)

1.3 Context window

The context window is the maximum number of tokens the model can "see" at once, including the system prompt, conversation history, retrieved documents, and generated response.

Model	Approx. window (2025)
Claude Opus 4.8	200,000 tokens
GPT-4o	128,000 tokens
Gemini 1.5 Pro	1,000,000 tokens
Llama 3.1 70B	128,000 tokens
Mistral Large	128,000 tokens

When does size matter? When you have long documents (contracts, technical manuals) and want to pass the entire document to the model — this is called "long-context RAG" or even "context stuffing". A large window is convenient but not free: more tokens = more latency and more cost.

When NOT to use huge windows: if the document has 1,000 pages, even 1M tokens is not enough, and the model can lose the thread in the middle ("lost in the middle problem"). RAG solves that by retrieving only the relevant fragments (§5).

1.4 Model parameters

LLMs have billions of parameters (network weights). Size matters but is not everything:

Large models (70B+): stronger reasoning, higher cost, higher latency.
Small models (7B–13B): fast and cheap, good for classifiable tasks.
Distilled models (Haiku 4.5, GPT-4o-mini, Gemma 2B): quality/cost balance for high-volume production.

RAGorbit connection: the model.llm node has a model field that accepts any provider:model-name string. The default is anthropic:claude-opus-4-8. Change that field and you change the model — without touching the rest of the flow. See docs/02-node-catalog.md §model.

2. Temperature and other inference hyperparameters

2.1 Temperature

Temperature controls how "creative" or "deterministic" the model's response is. Technically, it is a divisor of the logit before softmax: low temperature concentrates probability on the most likely tokens; high temperature spreads it out.

temperatura 0.0 → respuesta casi determinista (mismo input, mismo output)
temperatura 0.2 → respuestas muy consistentes, con poca variación
temperatura 0.7 → respuestas variadas, más "creativas"
temperatura 1.0 → distribución sin modificar
temperatura > 1.0 → respuestas caóticas, poco coherentes

Practical rule for RAG: use low temperature (0.0–0.2) when you need factual responses based on documents. Use higher temperature only when you want variety (e.g. generating wording options).

In the HR template (09-hr-policy-assistant/flow.json), the model.llm node has "temperature": 0.2. The assistant must be precise, not creative.

2.2 Top-p and Top-k

Top-p (nucleus sampling): only considers tokens whose cumulative probability reaches p. top_p=0.9 = take the tokens that represent 90% of total probability.
Top-k: only considers the k most likely tokens at each step.

For factual RAG: top_p=0.9 or less. Most APIs expose it but the default is reasonable — you rarely need to touch it.

2.3 Max output tokens

Distinct from the context window: it is the limit you set on the generated response. Useful for controlling cost and avoiding infinite responses. For an HR assistant, 512–1024 output tokens is usually enough.

3. Prompting: system, user, few-shot, CoT

Prompting is how you give instructions to the model. It is not magic — it is text engineering.

3.1 Chat format: system and user roles

Modern chat LLMs use a message format with roles:

[system]  Eres el asistente oficial de RRHH. Responde basándote SOLO en los documentos.
[user]    ¿Cuántos días de vacaciones tengo el primer año?
[assistant]  (respuesta del modelo)

system: persistent instructions that define model behavior. Sent once at the start. The model "remembers" it for the whole conversation (while it stays in the window).
user: the human's message.
assistant: the response the model generated (in multi-turn conversations, history is passed back).

When to use system vs user: put in system what does NOT change (personality, constraints, response format). Put in user what does change (the question, dynamic context like retrieved chunks).

In template 09, the logic.prompt node uses:

system: HR assistant instructions
template: a template with {message} and {chunks} filled dynamically

3.2 Few-shot prompting (In-Context Learning)

In-context learning is the LLM's ability to learn a task simply by seeing examples in the prompt — without retraining. This works because the model saw millions of "input→output" pairs during pretraining and can "imitate" the pattern.

Few-shot = give a few examples (2–5 typically):

[system] Clasifica si la siguiente pregunta es sobre vacaciones, beneficios o nómina.

Pregunta: ¿Cuándo cobro el aguinaldo?
Categoría: nómina

Pregunta: ¿Puedo pedir días por enfermedad de un familiar?
Categoría: vacaciones

Pregunta: ¿Cómo agrego a mi cónyuge al seguro médico?
Categoría: beneficios

Pregunta: {nueva_pregunta}
Categoría:

Zero-shot = no examples. Works well with large models and common tasks. One-shot = a single example.

When to use few-shot:

The task has a specific output format the model does not produce well without examples.
The model makes errors with zero-shot (evaluate with zero-shot first, then add examples only if needed).
Do not overuse: each example consumes context window tokens.

3.3 Chain-of-Thought (CoT)

Chain-of-Thought tells the model to "think out loud" before answering. It significantly improves reasoning on questions that require multiple steps.

Versión sin CoT:
[user] ¿Tiene derecho a vacaciones un empleado que lleva 8 meses?
[assistant] No tiene derecho completo todavía.  ← puede estar bien o mal

Versión con CoT:
[user] ¿Tiene derecho a vacaciones un empleado que lleva 8 meses?
       Piensa paso a paso antes de responder.
[assistant]
1. La política dice que los empleados acumulan 1 día por mes completo trabajado.
2. 8 meses completos = 8 días acumulados.
3. Por tanto, sí tiene derecho a 8 días de vacaciones proporcionales.
Respuesta: Sí, tiene derecho a 8 días de vacaciones proporcionales.

When to use CoT:

Complex reasoning questions (eligibility, calculations, multi-step).
When you need to audit the model's reasoning (the "step by step" exposes it).
Not necessary for simple fact lookup questions.

Zero-shot CoT: simply add "Piensa paso a paso." at the end of the prompt. Works surprisingly well.

3.4 Prompt templates with variables

In production, the prompt is not written "by hand" on each call. Templates with variables are used and substituted dynamically:

TEMPLATE = """Eres el asistente de RRHH.

Pregunta del empleado: {message}

Fragmentos de política relevantes:
{chunks}

Responde en markdown con lenguaje sencillo."""

prompt = TEMPLATE.format(
    message="¿Cuántos días de vacaciones tengo?",
    chunks="§3.1 Vacaciones: Los empleados acumulan 1 día por mes..."
)

This is exactly what the logic.prompt node in template 09 does.

4. Why RAG: hallucination, private data, fresh data

4.1 The hallucination problem

LLMs generate text — they do not retrieve it from a database. When they do not know the answer, instead of saying "I don't know", they tend to invent a plausible answer with total confidence. This is called hallucination.

Pregunta: ¿Cuántos días de vacaciones por ley corresponden en México?
LLM sin RAG: "15 días en el primer año, aumentando 2 días por cada año adicional."
← Correcto para México. Pero si preguntas por la política interna de tu empresa...

Pregunta: ¿Cuántos días de vacaciones da Empresa X el primer año?
LLM sin RAG: "Empresa X otorga 20 días hábiles el primer año..." ← INVENTADO

The model "knows" that companies have vacation policies and generates something plausible — but it has no access to your company's real policy.

RAG solves this by passing the model real documents as context. The model no longer invents: it reasons over text you provide.

4.2 The private data problem

Pretrained LLMs only know what was in their training corpus — which is public. Your employee handbook, your contracts, your customer database are not there.

Options to include private knowledge:

RAG (this module): retrieve relevant documents in real time and put them in the prompt.
Fine-tuning (§9): retrain the model on your data — costly, requires expertise.
Context stuffing: put the entire document in the prompt — only works for small documents.

RAG is the most practical option for most cases.

4.3 The fresh data problem

Pretraining has a knowledge cutoff date. A model trained through March 2024 knows nothing of what happened after — no law changes, no new company policies, no current prices.

RAG enables real-time knowledge because the documents you retrieve are the ones you keep updated. Update the index → the model automatically uses the new information.

4.4 Summary: when you do NOT need RAG

General task: draft an email, summarize text the user pastes directly, translate.
Public knowledge well covered: programming questions, mathematics, general history.
Small static data that fits in the context window: you can do direct "context stuffing".

5. The minimal RAG pattern

RAG = Retrieval-Augmented Generation. In one sentence: before calling the LLM, you retrieve the most relevant document fragments for the question and put them in the prompt.

5.1 The four steps

┌─────────────────────────────────────────────────────────────────────┐
│                       MINIMAL RAG PATTERN                            │
│                                                                       │
│  1. QUESTION                                                          │
│     The user writes: "¿Cuántos días de vacaciones tengo?"         │
│                │                                                      │
│                ▼                                                      │
│  2. RETRIEVE                                                          │
│     Convert the question into an embedding vector.                    │
│     Search the index for the K most similar fragments.               │
│                │                                                      │
│                ▼                                                      │
│  3. AUGMENT PROMPT                                                    │
│     Build the prompt: instructions + retrieved chunks                 │
│     + user question.                                                  │
│                │                                                      │
│                ▼                                                      │
│  4. RESPOND (Generate)                                                │
│     The LLM generates the response using ONLY the given context.         │
│     "Según §3.1, tienes 12 días hábiles el primer año."              │
└─────────────────────────────────────────────────────────────────────┘

5.2 The full flow (with offline indexing)

The RAG pattern has two phases:

Offline phase (indexing): happens once, or when you update documents.

Documentos PDF/texto
        │
        ▼
  Chunking (split into fragments)
        │
        ▼
  Embedding of each chunk → vector
        │
        ▼
  Store vectors in an index (vector store)

Online phase (inference): happens on each user question.

Pregunta del usuario
        │
        ▼
  Embedding of the question → vector
        │
        ▼
  Search the index: top-K most similar chunks
        │
        ▼
  Build augmented prompt:
    system + chunks + question
        │
        ▼
  LLM generates response
        │
        ▼
  Response to user (with citations)

5.3 How it maps to template 09

Offline phase:
  loader.pdf  →  ingest.chunker  →  [model.embedding]  →  store.chroma

Online phase (per question):
  io.input (question)
      │
      ├──▶ retrieval.vector ◀── store.chroma (Retriever)
      │         │ top-4 chunks
      │         ▼
      └──▶ logic.prompt ◀── model.llm
               │ response with citations
               ▼
         logic.citations
               │
               ▼
           io.output

Each node in template 09 corresponds exactly to a step in the minimal RAG pattern.

5.4 Why top-K and not all chunks?

Even with a huge window, passing all chunks to the model is costly (tokens = money) and can confuse it ("lost in the middle"). The topK=4 parameter in template 09's retrieval.vector node means: retrieve only the 4 most relevant fragments. That number is an empirical default — 3–5 is the usual range for policy documents.

6. Embeddings: geometric intuition and cosine similarity

6.1 What is an embedding?

An embedding is a function that converts text (or any data) into a high-dimensional numeric vector. The key property: semantically similar texts end up close in vector space.

"política de vacaciones"    → [0.12, -0.34, 0.89, 0.01, ...]  (1536 números)
"días de descanso anuales"  → [0.11, -0.33, 0.91, 0.02, ...]  (muy cerca)
"precio del petróleo"       → [-0.67, 0.45, -0.23, 0.78, ...]  (muy lejos)

6.2 Geometric intuition

Imagine each text as a point in a 1,536-dimensional space (the size of text-embedding-3-large embeddings). Texts with the same meaning end up in "zones" of the space:

          HR policies
          ┌────────────────────┐
          │  vacaciones ●      │
          │  descanso ●        │     prices
          │  días libres ●     │  ┌──────────────┐
          └────────────────────┘  │ petróleo ●   │
                                  │ gas ●        │
                                  └──────────────┘

Similarity search consists of: "given the question embedding, which is the closest point in the space?"

6.3 Cosine similarity

The most common metric for comparing embeddings is cosine similarity: it measures the angle between two vectors, regardless of magnitude.

similitud_coseno(A, B) = (A · B) / (||A|| × ||B||)

Rango: -1 (opuestos) a 1 (idénticos)

In pure Python:

import math

def coseno(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x**2 for x in a))
    norm_b = math.sqrt(sum(x**2 for x in b))
    if norm_a == 0 or norm_b == 0:
        return 0.0
    return dot / (norm_a * norm_b)

6.4 Production embeddings vs toy embeddings

For the module lab you will use "toy" embeddings (bag-of-words or char n-grams). They are enough to understand the mechanism, but in production you will use dedicated models:

Embedding model	Dimensions	Notes
`text-embedding-3-large` (OpenAI)	3,072 (reducible)	RAGorbit default
`text-embedding-3-small` (OpenAI)	1,536	Cheaper, good quality
`embed-english-v3.0` (Cohere)	1,024	Multilingual available
`bge-large-en-v1.5` (BAAI, local)	1,024	Open weights, no API key
`E5-large` (Microsoft, local)	1,024	Excellent on benchmarks

When to use local embeddings: total privacy (no text leaves your server), no embedding token cost, integrable in Ollama.

RAGorbit connection: the model.embedding node in template 09 has "model": "text-embedding-3-large". Change that field to "local": true and a local model for embeddings without API.

6.5 The difference between embeddings and the LLM

This confuses many people: the LLM (model.llm node) and the embedding model (model.embedding node) are distinct models with distinct roles.

	LLM	Embedding model
Role	Generates response text	Converts text into vector
When used	At inference (generate response)	Offline (index) + online (embed question)
Output	Text tokens	Numeric vector
Example	`claude-opus-4-8`	`text-embedding-3-large`

7. Model selection and evaluation

7.1 The three decision axes

Choosing a model is always a balance of three variables:

         Quality
            ▲
            │       Claude Opus
            │    GPT-4o  ●
            │       ●
            │  Gemini Pro ●
            │
            │      Llama 70B ●
            │  Mistral Large ●
            │
            │    Claude Haiku ●
            │ GPT-4o-mini ●
            └──────────────────────────▶ Speed (low latency)
              ──────────────────────────▶
                      Cost (low)

You cannot maximize all three at once. The choice depends on your use case.

7.2 Evaluation metrics

To choose a model for RAG, evaluate on your own dataset with these metrics:

Metric	What it measures	Tool
Faithfulness	Is the response supported by the chunks?	RAGAS
Answer Relevancy	Does the response answer the question?	RAGAS
Context Precision	Are the retrieved chunks relevant?	RAGAS
Context Recall	Were all necessary chunks retrieved?	RAGAS
P95 Latency	95th percentile response time	Direct measurement
Cost per 1K questions	Input + output tokens × price	Calculated

Principle: do not choose a model by generic benchmark. Evaluate in your specific domain with an eval set of 50–200 questions with reference answers.

7.3 Recommended evaluation process

Build an eval set: 50–100 (question, correct answer) pairs based on your real documents.
Run the full RAG pipeline with each candidate model.
Measure faithfulness, answer relevancy, and latency.
Choose the model with the best quality/cost balance for your SLA.

7.4 When to use a small model

For light classification tasks (is this question about vacation, payroll, or benefits?), a small model (Haiku 4.5, GPT-4o-mini, Llama 3.1 8B) may be enough and 10–100x cheaper.

In RAGorbit, the model.intent node is exactly for this: classify before running expensive RAG. If the question is not about HR, you do not call the full pipeline.

8. Comparison: Claude vs OpenAI vs Gemini vs Llama/Mistral

8.1 Closed models (closed-source / proprietary)

Accessed via API. You cannot see the weights, you cannot deploy on your server.

	Claude (Anthropic)	GPT (OpenAI)	Gemini (Google)
Main models	Opus 4.8, Sonnet 4.6, Haiku 4.5	GPT-4o, GPT-4o-mini	Gemini 1.5 Pro, Flash
Context window	200K	128K	1M
Strengths	Long reasoning, instruction following, safety	Broad ecosystem, function calling	Huge window, multimodal, Google integration
Price (approx, 2025)	Opus: ~$15/MTok output	GPT-4o: ~$10/MTok output	Pro: ~$7/MTok output
Offline mode	No	No	No
RAGorbit default	`anthropic:claude-opus-4-8`	configurable	configurable

RAGorbit uses LangChain's init_chat_model, which supports all these providers by changing only the model field.

8.2 Open-weights models

Weights are public. You can download them, run them on your hardware or in Ollama.

	Llama (Meta)	Mistral	Gemma (Google)
License	Llama 3 Community	Apache 2.0 (Mistral 7B)	Gemma Terms
Models	Llama 3.1 8B/70B/405B	Mistral 7B, Mixtral 8x7B, Mistral Large	Gemma 2 2B/9B/27B
How to use	Ollama, HuggingFace, vLLM	Ollama, Mistral API	Ollama, HuggingFace
Cost	Infrastructure only	Infrastructure only	Infrastructure only
Privacy	Total (no external API)	Total if local	Total if local

8.3 Hugging Face and Ollama

Hugging Face is the central repository for open-weights models. It has an inference API (paid or free with limits) and thousands of models for embeddings, LLMs, vision models, etc.

Ollama is the easiest way to run models locally:

ollama run llama3.1         # descarga y corre Llama 3.1 8B
ollama run mistral          # Mistral 7B
ollama run nomic-embed-text # embeddings locales

In RAGorbit you can point to Ollama by changing model to ollama:llama3.1 in the model.llm node.

8.4 When to use what

Situation	Recommendation
Quick prototype, cost doesn't matter	Claude Opus 4.8 or GPT-4o
High-quality production, flexible budget	Claude Sonnet 4.6 or GPT-4o
High-volume production, minimize cost	Claude Haiku 4.5 or GPT-4o-mini
Confidential data, no cloud	Llama 3.1 70B via Ollama or vLLM
No network / isolated environment	Ollama with local model
Embeddings with total privacy	Local `bge-large` or `E5` via Ollama

9. RAG vs fine-tuning vs pure prompting: when to use each

9.1 Three strategies to "teach" the model

PURE PROMPTING             RAG                      FINE-TUNING
──────────────────         ──────────────────────   ──────────────────
Instructions in            Retrieve relevant        Retrain model
the prompt                 documents and put        weights with
                           them in the prompt       your data

Cost:       minimal        medium                    high
Privacy:    low            medium                    high (if local)
Updatable:  instant        when you update index     when you retrain
Needs data: no             yes (documents)          yes (many Q/A pairs)

9.2 Pure prompting

When it works well:

General tasks where the model already has good knowledge (draft, reformat, translate).
Recent public knowledge well covered in pretraining.
Few-shot for simple classification tasks.

When it is not enough:

The model does not know your private data.
Information changes frequently (knowledge cutoff).
You need to cite the exact source (the model can invent citations).

9.3 RAG

When to use RAG:

You have proprietary documents (manuals, contracts, FAQs, knowledge base).
Information changes and you need the model to always use the latest version.
You need traceability (cite the exact fragment that supported the answer).
Limited budget for fine-tuning.

When RAG alone is not enough:

Tasks requiring very domain-specific reasoning (e.g. interpreting complex legal clauses) — RAG provides context, but the base model may not reason well over it without additional training.
When "knowledge" is procedural and lives in model behavior, not in documents.

9.4 Fine-tuning

When to consider fine-tuning:

You need the model to adopt a very specific style or tone (brand voice).
The domain is so specialized that the base model makes consistent errors even when given context (medicine, highly technical law).
You have many high-quality (input, output) pairs (minimum 500–1000, better 5,000+).
Production volume justifies training cost.

When NOT to use fine-tuning:

As a first option (RAG is cheaper and faster to iterate).
When data changes frequently (retraining is costly).
When you do not have quality data.

9.5 Combination: RAG + fine-tuning

In mature systems they are used together: the fine-tuned model understands the domain better and reasons better over the context RAG provides. Fine-tuning teaches "how to reason"; RAG provides the "what".

9.6 Decision table

Do you have updatable proprietary documents?
  NO → Pure prompting (zero/few-shot)
  YES → RAG

Does RAG + base model give sufficient quality?
  YES → Stay with RAG
  NO → Do you have +1000 quality Q/A pairs?
         NO → Improve prompting/retrieval
         YES → Consider fine-tuning on the base model
              or fine-tuning + RAG

11. Layer ③ explained: LangChain from scratch

Prerequisite: have implemented lab layer ② (lab/solucion_scratch.py) or at least understand each function you wrote by hand. This section is the LangChain foundation for the entire course — modules M2–M11 will link here. Read it completely before attempting to write lab/solucion_framework.py.

Environment: on this study machine there is no pip or network. You will not be able to run the code in this section here. The goal is that, when you have an environment with pip install langchain langchain-community langchain-openai chromadb and an API key, you can write the framework solution yourself — not just read it.

11.1 What LangChain is and why it exists

Imagine you just finished solucion_scratch.py. It works. But to take it to production you need:

Real embeddings (OpenAI API or local Ollama).
A vector store with an efficient index (Chroma, not an in-memory list).
A real LLM (GPT-4o, Claude…).
Reusable prompt templates.
Wire everything: query → retrieve → format → prompt → LLM → text.

In scratch, you wrote that wiring function by function. In a real system, that wiring repeats in every RAG project with small variations. LangChain exists so you don't rewrite the wiring every time — it gives you standard pieces that fit together.

Analogy: you built a circuit with loose wires (scratch). LangChain is a box of modules with standard connectors: each piece has a known interface (Document, Embeddings, VectorStore, Retriever, Runnable) and you connect them with | instead of calling functions by hand.

SCRATCH (you wire everything)          LANGCHAIN (pieces + connectors)
─────────────────────────              ─────────────────────────────────
cargar_chunks()          ────────────▶  TextLoader + CharacterTextSplitter
embed()                  ────────────▶  OpenAIEmbeddings (Embeddings interface)
lista en memoria         ────────────▶  Chroma.from_documents(...)
similitud + sort         ────────────▶  vectorstore.as_retriever(...)
construir_prompt()       ────────────▶  ChatPromptTemplate
llm fake                 ────────────▶  ChatOpenAI / ChatAnthropic
main() secuencial        ────────────▶  LCEL chain with | operator

What problem it removes:

Without LangChain	With LangChain
Reimplement chunking, embedding, top-k search in every project	Loaders, splitters, stores, and retrievers ready
Switching OpenAI to Anthropic = rewrite HTTP calls	Change `ChatOpenAI` to `ChatAnthropic` — one line
Pipeline is imperative code hard to test step by step	LCEL decomposes flow into composable steps (`Runnable`)
No convention: each team names things differently	Same interface in tutorials, RAGorbit, and production

What LangChain does NOT do: it does not improve RAG quality by itself. If your chunks are bad or the prompt is weak, LangChain won't fix it. It only orchestrates better what you already designed in §5 and §6.

11.2 Bridge table: scratch → LangChain

This table maps each function in solucion_scratch.py to its LangChain abstraction in solucion_framework.py:

What you did by hand (layer ②)	LangChain piece (layer ③)	RAGorbit node (template 09)
`cargar_chunks(ruta)` — read txt and split by `---`	`TextLoader(...).load()` + `CharacterTextSplitter(...).split_documents(...)`	`loader` + `ingest.chunker`
`embed(texto)` — bag-of-words → `dict`	`OpenAIEmbeddings(model=...)` (implements `Embeddings` interface)	`model.embedding`
In-memory `chunks` list + vectors computed on the fly	`Chroma.from_documents(documents, embedding, collection_name=...)`	`store.chroma`
`similitud_coseno()` + `sort` in `recuperar()`	`vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3})`	`retrieval.vector`
`recuperar()` returns `(index, sim, text)`	`retriever.invoke(query)` returns `list[Document]`	(same retriever)
`construir_prompt()` — f-string with numbered chunks	`ChatPromptTemplate.from_messages([("system",...), ("human",...)])`	`logic.prompt`
(no real LLM in scratch)	`ChatOpenAI(model=..., temperature=...)` or `ChatAnthropic(...)`	`model.llm`
`main()` calls functions in order	LCEL chain: `dict	prompt

11.3 The `Document` object

LangChain does not work with loose strings for indexable documents. It uses Document:

# Conceptual — each chunk is a Document
doc = Document(
    page_content="POLÍTICA DE VACACIONES §3 — Acumulación y disfrute\nLos empleados...",
    metadata={"source": "datos/politicas_rrhh.txt", "chunk": 0},
)

page_content: the fragment text (equivalent to each string in your scratch chunks list).
metadata: dictionary of tags (source, section, date…). In scratch you had no metadata; in production it enables hard filters (M4): "only chunks from section=§3".

Loaders and splitters produce list[Document]. Vector stores consume list[Document]. Retrievers return list[Document].

11.4 Loaders: `TextLoader`

A loader reads an external source and converts it into LangChain documents.

from langchain_community.document_loaders import TextLoader

loader = TextLoader("datos/politicas_rrhh.txt", encoding="utf-8")
documentos_raw = loader.load()
# documentos_raw: list[Document] — typically ONE Document with the whole file

Scratch equivalent: open the file and read contenido = f.read() — but wrapped in a Document with metadata={"source": "datos/politicas_rrhh.txt"}.

In M2 you will see loaders for PDF, web, SQL, etc. The pattern is always the same: .load() → list[Document].

11.5 Text splitters: `CharacterTextSplitter`

The policy file has 8 fragments separated by \n---\n. In scratch you did re.split(r"\n---\n", contenido). In LangChain:

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator="\n---\n",    # dónde cortar (igual que tu separador)
    chunk_size=1000,         # máximo de caracteres por chunk (respaldo si un bloque es enorme)
    chunk_overlap=0,         # cuántos caracteres se solapan entre chunks consecutivos
    keep_separator=False,    # si True, el separador queda dentro del chunk
)
chunks = splitter.split_documents(documentos_raw)
# chunks: list[Document] — 8 Document, uno por fragmento de política

Parameter	What it does
`separator`	String (or regex) where to split. Here it replicates `cargar_chunks()`.
`chunk_size`	Character limit. If a block exceeds 1000, it splits again.
`chunk_overlap`	Repeat N characters from the end of the previous chunk at the start of the next — useful to avoid cutting sentences in half (M2).
`keep_separator`	`False` = the `---` does not appear in `page_content`.

.split_documents(...) receives list[Document] and returns smaller list[Document]. Do not confuse with .split_text(...) which works on strings.

11.6 The `Embeddings` interface

In scratch, embed() returned a dict[str, float]. In production, an embedding is a dense vector of hundreds or thousands of floats. LangChain unifies all providers under the Embeddings interface:

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# La API key se lee de la variable de entorno OPENAI_API_KEY

vec = embeddings.embed_query("¿Cuántos días de vacaciones?")
# vec: list[float] de 1536 dimensiones

vectores = embeddings.embed_documents(["chunk 1", "chunk 2"])
# vectores: list[list[float]] — uno por documento

Key methods:

Method	When used	Scratch equivalent
`embed_query(texto)`	A user question (online phase)	`embed(query)`
`embed_documents(lista)`	Many chunks when indexing (offline phase)	`embed(chunk)` in a loop

Local alternative (no API key):

from langchain_community.embeddings import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="nomic-embed-text")

Chroma and other stores do not know whether you use OpenAI or Ollama — they only call .embed_query() / .embed_documents() on the object you pass. That is the power of the interface.

11.7 VectorStore: `Chroma.from_documents`

A vector store stores (Document, vector) pairs and enables similarity search. In scratch it was an in-memory list; in LangChain:

from langchain_community.vectorstores import Chroma

vectorstore = Chroma.from_documents(
    documents=chunks,              # los 8 Document del splitter
    embedding=embeddings,            # objeto OpenAIEmbeddings
    collection_name="hr_policies",   # nombre de la colección (como en template 09)
)

What .from_documents does internally (offline phase):

chunks (8 Document)
    │
    ├──▶ embeddings.embed_documents([doc.page_content for doc in chunks])
    │         → 8 vectores de 1536 floats
    │
    └──▶ Chroma almacena (id, vector, page_content, metadata) en índice HNSW

Equivalent to your loop for chunk in chunks: embed(chunk) + store in memory, but with an index optimized for millions of vectors. To persist to disk: persist_directory="./chroma_db" (M3).

11.8 Retriever: `as_retriever` and `.invoke`

The vector store knows how to search, but the RAG pipeline wants an object with a uniform interface: Retriever. You obtain it like this:

retriever = vectorstore.as_retriever(
    search_type="similarity",      # búsqueda por similitud coseno
    search_kwargs={"k": 3},        # top-3, como k=3 en recuperar()
)

resultado = retriever.invoke("¿Cuántos días de vacaciones si llevo 3 años?")
# resultado: list[Document] — 3 documentos, del más al menos similar

Scratch equivalent:

# recuperar(query, chunks, k=3) → list[tuple[int, float, str]]
# retriever.invoke(query)        → list[Document]  (sin índice ni score expuesto por defecto)

Parameter	Meaning
`search_type="similarity"`	Orders by cosine similarity (Chroma default).
`search_kwargs={"k": 3}`	How many documents to return — template 09's `topK=4`.

Important prediction: retriever.invoke(query) does not return a string or an embedding. It returns list[Document] — objects with .page_content and .metadata. See exercise 19.

11.9 Chat models: `ChatOpenAI` and `ChatAnthropic`

In scratch you did not call a real LLM. In framework:

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.2)

# Alternativa Anthropic (default en RAGorbit):
from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(model="claude-opus-4-8", temperature=0.2)

model: model identifier in the provider API.
temperature: same concept as §2.1 — for factual RAG use 0.0–0.2.

The llm object is a Runnable: you can compose it with | (see §11.10). When it receives a formatted prompt, it returns an AIMessage (not a plain string — that's why you need StrOutputParser at the end).

11.10 Prompt templates: `ChatPromptTemplate`

In scratch, construir_prompt() was an f-string. In LangChain, prompts are templates with variables:

from langchain.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system", "Eres el asistente de RRHH. Responde SOLO con los fragmentos dados."),
    ("human", """Fragmentos relevantes:
{contexto}

Pregunta del empleado: {pregunta}

Responde en markdown."""),
])

("system", ...) → system message (§3.1).
("human", ...) → user message with variables {contexto} and {pregunta}.
When invoked with {"contexto": "...", "pregunta": "..."}, LangChain fills the variables and produces a ChatPromptValue ready for the LLM.

Variables must match exactly the keys of the dict that feeds the chain (next section).

11.11 LCEL: the `|` operator, `Runnable`, and the dict pattern

LCEL (LangChain Expression Language) is how you compose pipeline steps. Three key ideas:

Idea 1: everything chainable is a `Runnable`

A Runnable is any LangChain object that implements .invoke(input) (and optionally .stream(), .batch()). Examples: retriever, prompt, llm, StrOutputParser, wrapped functions.

The | operator connects two Runnables: the left output goes into the right.

A | B | C
  ≡  C(B(A(input)))

Idea 2: `RunnablePassthrough` passes input through unchanged

from langchain.schema.runnable import RunnablePassthrough

RunnablePassthrough()  # invoke("hola") → "hola"

Useful when one branch of the pipeline needs the original input (the question) while another branch transforms it (retrieve chunks).

Idea 3: the dict runs branches in parallel and fills the prompt

from langchain.schema.output_parser import StrOutputParser

def formatear_chunks(docs: list) -> str:
    return "\n\n".join(f"[{i+1}] {d.page_content}" for i, d in enumerate(docs))

chain = (
    {
        "contexto": retriever | formatear_chunks,
        "pregunta": RunnablePassthrough(),
    }
    | prompt
    | llm
    | StrOutputParser()
)

Step-by-step flow when you call chain.invoke(query):

INPUT: query = "¿Cuántos días de vacaciones...?"

STEP 1 — The dict (parallel branches):
┌─────────────────────────────────────────────────────────────┐
│  "contexto": retriever | formatear_chunks                   │
│      query ──▶ retriever.invoke(query)                      │
│             ──▶ list[Document] (3 docs)                       │
│             ──▶ formatear_chunks(docs)                      │
│             ──▶ "[1] POLÍTICA §4...\n\n[2] POLÍTICA §3..."   │
│                                                             │
│  "pregunta": RunnablePassthrough()                          │
│      query ──▶ query  (sin cambios)                         │
└─────────────────────────────────────────────────────────────┘
        │
        ▼
  {"contexto": "[1] ...", "pregunta": "¿Cuántos días..."}

STEP 2 — prompt:
  ChatPromptTemplate fills {contexto} and {pregunta}
        │
        ▼
  ChatPromptValue (system + human messages ready)

STEP 3 — llm:
  Provider API → AIMessage with response
        │
        ▼
STEP 4 — StrOutputParser:
  Extracts text string from AIMessage
        │
        ▼
OUTPUT: "Según la Política §3, tienes derecho a 18 días hábiles..."

Why the dict and not a single linear chain: the question must reach the prompt intact ({pregunta}), but the retriever needs the same question as input to search. RunnablePassthrough() prevents the question from being lost or overwritten by chunks.

StrOutputParser: the LLM returns a rich object (AIMessage). The parser extracts .content as str — what you print or return to the user.

11.12 Lab pipeline walkthrough, block by block

This is the complete walkthrough of lab/solucion_framework.py, line by line conceptually:

┌──────────────────────────────────────────────────────────────────┐
│  IMPORTS                                                         │
│  TextLoader, CharacterTextSplitter, OpenAIEmbeddings, Chroma,    │
│  ChatOpenAI, ChatPromptTemplate, RunnablePassthrough,            │
│  StrOutputParser                                                 │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│  BLOCK 1 — LOAD AND CHUNK          (≈ cargar_chunks)          │
│  loader = TextLoader("datos/politicas_rrhh.txt")                 │
│  documentos_raw = loader.load()        # 1 Document grande       │
│  splitter = CharacterTextSplitter(separator="\n---\n", ...)       │
│  chunks = splitter.split_documents(...)  # 8 Document            │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│  BLOCK 2 — EMBEDDINGS + CHROMA       (≈ embed + índice)         │
│  embeddings = OpenAIEmbeddings(model="text-embedding-3-small")   │
│  vectorstore = Chroma.from_documents(chunks, embeddings, ...)    │
│  # Indexa 8 vectores semánticos en memoria                       │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│  BLOCK 3 — RETRIEVER                 (≈ recuperar)              │
│  retriever = vectorstore.as_retriever(                           │
│      search_type="similarity", search_kwargs={"k": 3})           │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│  BLOCK 4 — PROMPT + LLM              (≈ construir_prompt + LLM) │
│  llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.2)        │
│  prompt = ChatPromptTemplate.from_messages([                     │
│      ("system", SYSTEM_PROMPT),                                  │
│      ("human", HUMAN_TEMPLATE),  # {contexto}, {pregunta}        │
│  ])                                                              │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│  BLOCK 5 — LCEL CHAIN                (≈ main orquestado)        │
│  chain = (                                                       │
│      {"contexto": retriever | formatear_chunks,                  │
│       "pregunta": RunnablePassthrough()}                        │
│      | prompt | llm | StrOutputParser()                         │
│  )                                                               │
└──────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────────┐
│  BLOCK 6 — EXECUTE                                             │
│  chunks_recuperados = retriever.invoke(query)  # inspección      │
│  respuesta = chain.invoke(query)               # respuesta final │
└──────────────────────────────────────────────────────────────────┘

Ranking difference vs scratch: with real semantic embeddings, §3 ("Después de 3 años… 18 días") usually ranks first — not §4 as in bag-of-words. The mechanism is identical; vector quality changes (§6.4).

11.13 When to use LangChain / when NOT — and gotchas

When YES:

RAG prototypes and production where you want to switch providers (OpenAI ↔ Anthropic ↔ Ollama) without rewriting.
Pipelines with many steps (retrieve → rerank → prompt → LLM → parser) — LCEL composes them cleanly.
Teams already using the ecosystem (LangSmith, LangGraph in M6+).

When NOT (or not LangChain alone):

One-off 30-line script, no provider change → scratch or direct requests may suffice.
Maximum latency/cost control → direct API calls without intermediate layer.
Already using LlamaIndex/CrewAI with another mental model → don't mix two frameworks without reason (M2 compares).

Common gotchas:

Gotcha	What happens	Solution
Package versions	Imports change between LangChain 0.1 and 0.2+ (`langchain.schema` vs `langchain_core`)	Pin versions in `requirements.txt`; this course uses the style of `solucion_framework.py`
Missing API key	`OpenAIEmbeddings` / `ChatOpenAI` fail without `OPENAI_API_KEY`	Export the variable or use `OllamaEmbeddings` + local model
`\|` with non-Runnable object	`TypeError` when composing	Only Runnables, functions, or dicts of Runnables in LCEL
Prompt variables	`{context}` in template but `"contexto"` in dict → KeyError	Identical names in template and dict
`retriever.invoke()` vs `chain.invoke()`	First returns docs; second returns LLM response	Use retriever only to inspect; chain for final response
`CharacterTextSplitter` with wrong separator	1 giant chunk or too many chunks	Same `\n---\n` as in scratch

11.14 Environment note and next step

Do not run this section in the course environment without network. Study, write solucion_framework.py in the lab (see lab/enunciado.md layer ③), and compare with the reference solution.

Cross-links:

Minimal RAG pattern (the 4 steps): §5
Embeddings and cosine similarity (what replaces embed()): §6
Scratch + framework lab: lab/enunciado.md, lab/solucion_framework.py
Full template 09: ../../examples/09-hr-policy-assistant/

Beyond Lang*: this same HR RAG is implemented with LlamaIndex, with the provider native SDK (no framework), and with Haystack in ../referencia/rag-sin-langchain.md. LangChain is the course default because it is what generates RAGorbit, but the goal is for you to understand the mechanism (layer ②) and be able to use any stack. Also read the honest critiques of the LangChain/LangGraph/LangSmith stack.

12. Checkpoint

You know it if you can…

Explain what a token is and approximately calculate how many tokens a paragraph has.
Describe what happens when temperature is 0 vs 0.7.
Write a system/user prompt for the HR assistant that avoids hallucinations.
Explain the 4 steps of the minimal RAG pattern without looking.
Draw the offline and online phase diagram for template 09.
Calculate cosine similarity between two 3-dimensional vectors by hand.
Decide between RAG, fine-tuning, and pure prompting for a given case.
Name at least 2 open models and how to run them locally.
Map each function in solucion_scratch.py to its LangChain piece (table §11.2).
Explain what retriever.invoke(query) returns and what the | operator does in LCEL.
Write from scratch (on paper or in an editor) the lab LCEL chain without copy-pasting.

If something is unclear, review:

Tokens → §1.2
Temperature → §2.1
Why RAG → §4
Cosine similarity → §6.3
RAG vs fine-tuning → §9
LangChain from scratch → §11

Next: go to ejercicios.md (includes LangChain block) and then to lab/enunciado.md.

Cross-links:

model node catalog: ../../docs/02-node-catalog.md#model--modelos

Full HR template: ../../examples/09-hr-policy-assistant/

Compared technologies (full table of models and stores): ../referencia/tecnologias-comparadas.md

Glossary: ../referencia/glosario.md

← Back to course View on GitHub →