Advanced retrieval
M4 · Advanced retrieval and query operations
Module 4 of the RAG & Agentic AI course — RAGorbit Nodes covered:
retrieval.*,query.*,store.neo4j,store.multi-indexAnchor templates: 05-legal, 07-telecom, 08-manufacturing, 03-healthcare Week 4 · ~32 h (reading + exercises + workshop)
Table of contents
- The problem of retrieving the right content
- Dense (vector) search
- Keyword search: BM25
- Hybrid search
- Reranking with cross-encoder
- Parent-child retrieval
- Hard filters as a safety guardrail
- Multi-index routing
- Query rewriting and intent detection
- GraphRAG and knowledge graphs (Neo4j)
- Technology comparison
- RAGorbit nodes
- Layer ③ explained: LangChain retrievers from scratch
- Checkpoint
1. The problem of retrieving the right content
RAG has a dependency chain: if retrieval fails, the LLM cannot generate a correct answer even if it is the best model in the world. You can have the most accurate embedding model and the most expensive LLM, but if the retrieved chunks are irrelevant or belong to the wrong domain, the answer will be wrong—or worse, plausibly wrong.
Retrieval fails in three dimensions:
SEMANTIC FAILURE — the user writes "baja de plan"
the embedding does not associate it with "cancelación de servicio"
→ BM25 or query rewriting fixes it
DOMAIN FAILURE — the A320 technician gets a torque limit from the 787
because the embedding is similar
→ hard filters fix it
RANKING FAILURE — the top-5 has relevant fragments but in positions 4 and 5
the LLM uses the first two (noisier)
→ reranking fixes it
This module covers the tools to address each type of failure.
2. Dense (vector) search
What it is
Dense search converts the query and each document into high-dimensional dense vectors (e.g. 1536 dimensions with text-embedding-3-large) and measures cosine similarity or dot product. It was covered in depth in M3. Here we recall it as a comparison point.
Query: "procedimiento inspección tren de aterrizaje"
↓ embedding model
[0.23, -0.11, 0.87, ...] ← 1536-dim vector
Corpus:
doc_A: [0.21, -0.09, 0.85, ...] sim=0.97 ← highly relevant
doc_B: [0.70, 0.45, 0.10, ...] sim=0.31 ← barely relevant
doc_C: [0.22, -0.10, 0.86, ...] sim=0.95 ← highly relevant
When it works well
- The query and documents are in the same semantic domain.
- Vocabulary is not highly specialized or jargon unseen during embedding model training.
- Chunks are moderate size (200–1000 tokens).
When it fails
- Highly specific technical jargon: "ATA 32-11-00" or "RRF" do not have good representations in general-purpose embeddings.
- Short, exact queries: "GDPR artículo 17" retrieves better with keywords than with vectors.
- Internal company terms: the corporate glossary is not in the training data.
RAGorbit node
retrieval.vector — topK: 4 by default. Accepts hardFilters[] (see §7).
3. Keyword search: BM25
What it is and where it comes from
BM25 (Best Match 25) is the probabilistic ranking function Elasticsearch uses internally and that was the state of the art in information retrieval for decades before embeddings took off. Its name comes from a series of experiments in the 1970s–90s (Okapi BM11, BM15… up to BM25).
The BM25 formula
For a query q with terms q_1 ... q_n and a document d:
n IDF(q_i) · f(q_i, d) · (k1 + 1)
BM25 = Σ ────────────────────────────────────────
i=1 f(q_i, d) + k1 · (1 - b + b · |d|/avgdl)
Where:
| Symbol | Meaning |
|---|---|
f(q_i, d) |
Term frequency of q_i in document d (term frequency) |
IDF(q_i) |
Inverse Document Frequency: log((N - n_i + 0.5) / (n_i + 0.5) + 1) |
N |
Total documents in the corpus |
n_i |
Number of documents containing q_i |
| ` | d |
avgdl |
Average document length |
k1 |
TF saturation parameter (typical: 1.2–2.0) |
b |
Length normalization parameter (typical: 0.75) |
Formula intuition
IDF: A term that appears in few documents is highly discriminative. "Tren de aterrizaje" appears in few corpus documents → high IDF → that term weighs heavily. "El" appears in all → IDF ≈ 0 → that term does not discriminate.
TF with saturation (k1): Relevance does not grow linearly with frequency. If "mantenimiento" appears 1 time vs 2 times, there is a difference. If it appears 50 vs 51 times, the difference is almost nil. Parameter k1 controls that saturation.
Length normalization (b): A 1000-word document will naturally have more repetitions of any term than a 100-word one. Parameter b penalizes long documents so they do not dominate ranking simply by being long. With b=0.75, partial normalization is applied (not total).
Why BM25 complements embeddings
Query: "ATA 32-11-00"
BM25: ← retrieves exact "ATA 32-11-00", high score
Vector: ← "landing gear chapter 32" may be semantically closer
but exact string "ATA 32-11-00" has better BM25
Query: "procedimiento para revisar sistemas hidráulicos antes de vuelo"
BM25: ← may fail if the doc says "inspección pre-vuelo de actuadores"
Vector: ← captures semantics even when words differ
Embedding models capture semantic intent but lose exact matches of technical terms. BM25 does the opposite. The combination is more robust than either alone.
RAGorbit node: retrieval.hybrid
Internally combines a vector retriever and a BM25 retriever with parameter alpha controlling relative weight.
4. Hybrid search
Fusion strategy: Reciprocal Rank Fusion (RRF)
RRF is the most common fusion method for combining result lists from different retrievers. The idea: instead of combining scores directly (which have different scales), use each document's rank in each list.
1
RRF(d) = Σ ────────────
r∈R k + r(d)
Where r(d) is document d's position in retriever r's list, and k is a smoothing constant (typical: 60).
Concrete example:
BM25 returns: doc_A (rank 1), doc_C (rank 2), doc_B (rank 3)
Vector returns: doc_C (rank 1), doc_A (rank 2), doc_D (rank 3)
RRF(doc_A) = 1/(60+1) + 1/(60+2) = 0.01639 + 0.01613 = 0.03252
RRF(doc_C) = 1/(60+2) + 1/(60+1) = 0.01613 + 0.01639 = 0.03252
RRF(doc_B) = 1/(60+3) + 0 = 0.01587
RRF(doc_D) = 0 + 1/(60+3) = 0.01587
Fused result: doc_A, doc_C (tie), doc_B, doc_D
Weighted sum of normalized scores
Alternative to RRF when scores are on the same scale:
score_final(d) = alpha * score_vector(d) + (1 - alpha) * score_bm25(d)
With alpha=0.5 both get equal weight. Tune alpha by domain.
When to use hybrid
| Situation | Recommended alpha |
|---|---|
| Technical domain with many exact identifiers | 0.3 (more BM25) |
| Conversational / natural language domain | 0.7 (more vector) |
| Unknown a priori | 0.5 (starting point) |
| With user feedback | tune with A/B testing |
Template 07 (Telecom)
The call center copilot uses retrieval.hybrid because agents mix technical jargon ("roaming internacional EE.UU.") with natural language ("¿qué le digo al cliente?"). BM25 captures exact glossary terms; the vector captures question intent.
Template 08 (Manufacturing)
AMM manuals have exact identifiers (ATA, section numbers, part numbers). BM25 is very precise for "Task 32-11-00-581-001". The vector captures "procedimiento inspección tren morro" even when the document says "nose landing gear inspection procedure".
5. Reranking with cross-encoder
The problem it solves
Retrievers (both vector and BM25) encode the query and each document separately and then compute similarity. This is efficient but imprecise: the model does not see the query and document together when producing the representation.
A cross-encoder (reranker) is a model that receives the query and the document together as input and produces a relevance score. It is much more accurate but also slower—which is why it is used only on the retrievers' top-K, not the full corpus.
TWO-STAGE PIPELINE (retrieve + rerank)
Step 1 — Fast retrieve (high recall)
BM25 + Vector → top-20 candidates
[fast, scales to millions of docs, but imprecise]
Step 2 — Precise rerank (high precision)
Cross-encoder scores query ↔ each candidate together
→ keeps top-3 most relevant
[slow, only applies to 20 candidates, very precise]
Why it improves precision
The bi-encoder (separate vectors) compresses the query into a vector without knowing which documents it will compare against. The cross-encoder, seeing both together, can capture subtle interactions:
Query: "límite de torque del actuador del tren de morro"
doc_A: "El torque máximo del actuador del tren principal es 45 Nm" ← mentions torque but of the MAIN gear
doc_B: "Para el tren de morro, el torque del actuador es 32 Nm" ← exactly what is being searched
Bi-encoder: doc_A may score similarly to doc_B (both talk about torque and gear)
Cross-encoder: doc_B scores much higher (nose gear + actuator + torque together)
Latency trade-off
| Component | Typical latency | Why |
|---|---|---|
| Vector retrieval (HNSW, top-20) | 10-50 ms | approximate in-memory index |
| BM25 (top-20) | 5-20 ms | inverted index in memory |
| BGE reranker over 20 docs | 50-150 ms | model forward pass per pair |
| Cohere API reranker over 20 docs | 100-300 ms | network call + large model |
The reranker adds ~100-200 ms to the pipeline, but the relevance improvement is usually worth it in domains where precision is critical (legal, healthcare, aviation).
Available reranker models
| Model | Type | Advantage | When to use |
|---|---|---|---|
bge-reranker-v2-m3 (BAAI) |
Local cross-encoder | Free, no API, fast | Production without external dependencies |
rerank-english-v3.0 (Cohere) |
Cloud API | High quality, very easy to integrate | Rapid prototyping, English |
ColBERT |
Late interaction | Latency/quality balance, allows pre-compute | Millions of docs |
FlashRank |
Very lightweight cross-encoder | Ultra fast, for edge/mobile | Latency < 50 ms critical |
Template 05 (Legal) and 07 (Telecom)
Both use retrieval.reranker with topN: 3. In legal, the reranker distinguishes a playbook fragment on "indemnización" in software contracts from the fragment on "indemnización" in infrastructure contracts—semantically similar but legally relevant in different ways. In telecom, it adjusts ranking based on agent feedback (feedbackRef).
RAGorbit node: retrieval.reranker
{
"type": "retrieval.reranker",
"config": {
"model": "bge-reranker",
"topN": 3,
"feedbackRef": "feedback_store" // optional: improves with usage signals
}
}
6. Parent-child retrieval
The chunk size dilemma
Small chunks (100-200 tokens): more precise for retrieving the exact fragment, but lose context (a sentence without its paragraph).
Large chunks (800-1200 tokens): have more context, but the vector representation averages the meaning of the whole chunk and can dilute the relevant signal.
Parent-child resolves this dilemma with a two-level strategy:
PARENT LEVEL (large chunks, 800+ tokens)
Complete section 32-11-00 (900-token procedure)
Complete section 32-11-01 (850-token procedure variant)
CHILD LEVEL (small chunks, 100-200 tokens)
Step 1: Coloca la aeronave en jack... (child of 32-11-00)
Step 2: Verifica el juego lateral... (child of 32-11-00)
Step 3: Inspecciona visualmente... (child of 32-11-00)
Step 4: Registra los resultados... (child of 32-11-00)
RETRIEVAL:
1. CHILDREN are indexed and retrieved (high precision)
2. PARENTS are returned to the LLM (full context)
When to use parent-child
- Documents with clear hierarchical structure: technical manuals, contracts, clinical guidelines.
- When index chunks are semantically dense but you need expanded context for the LLM to answer well.
- Template 08: each AMM procedure step is a child; the full ATA section is the parent.
When it is not worth it
- Corpus of independent fragments (tweets, individual FAQs, blog posts).
- Chunks already moderate (400-600 tokens) where context is sufficient.
- When the extra latency of fetching the parent (second lookup) is not acceptable.
RAGorbit node: retrieval.parent-child
{
"type": "retrieval.parent-child",
"config": {
"parentField": "parent_id"
}
}
parent_id is set in ingest.metadata at indexing time, linking each child chunk to its parent document.
7. Hard filters as a safety guardrail
The difference between soft filter and hard filter
A soft filter (or soft hint) instructs the LLM to "prefer" documents of a certain type. Example: "Answer only using information from the PPO-Gold plan". The problem: the LLM can ignore it, "forget" it in long prompts, or reason that another document is "relevant enough".
A hard filter is applied at the retrieval layer, before any document reaches the LLM. It is a WHERE clause in SQL, a metadata filter in the vector store. The LLM simply never sees documents that fail the filter.
WITHOUT HARD FILTER:
Query: "criterios de RM de rodilla"
Vector store returns: chunks from PPO-Gold, PPO-Basic, PPO-Platinum mixed
LLM may use PPO-Platinum criteria for a PPO-Basic patient → CLINICAL ERROR
WITH HARD FILTER (hardFilter: plan = "PPO-Basic"):
Query: "criterios de RM de rodilla"
Vector store applies WHERE plan = 'PPO-Basic' before search
Only PPO-Basic chunks reach the LLM → correct by design
Why it is a guardrail, not just a filter
In high-consequence domains, the hard filter acts as a structural safety guardrail:
Healthcare (03-healthcare): A patient on PPO-Basic cannot receive PPO-Platinum criteria (more permissive). An incorrect "approved" is a legal and clinical problem.
Aviation (08-manufacturing): An A320 technician cannot receive 787 torque limits. Aircraft confusion is an FAA/EASA finding.
Civil aviation (01-airline): An Economy passenger cannot see Business policies in the LLM context, because they might receive upgrades or benefits they did not purchase.
In all these cases, the prompt instruction "use only the correct plan/aircraft data" is not enough. The hard filter is deterministic and inviolable.
Implementation in the retrieval.vector node
{
"type": "retrieval.vector",
"config": {
"topK": 5,
"hardFilters": ["aircraft_type", "ata_chapter"]
}
}
In production, the node converts this into a filtered query:
# Pseudocode for the node generated by RAGorbit
results = pgvector_store.similarity_search(
query_embedding,
k=5,
filter={
"aircraft_type": {"$eq": session.aircraft_type},
"ata_chapter": {"$eq": session.ata_chapter}
}
)
Filter values come from session context, not the LLM.
Hard filter as a cross-cutting design pattern
This pattern appears in M3, M4, M5, and M9. In RAGorbit, hardFilters[] is available in retrieval.vector and retrieval.hybrid. Filterable fields are those tagged in ingest.metadata. The rule is: any dimension that determines which information is permissible for a specific user must be a hard filter, not a prompt instruction.
8. Multi-index routing
Why not a single index
The "simple" solution is to index everything in one vector store and search there. The problems:
Cross-domain noise: a query about "indemnización" in the context of a software contract may retrieve indemnification fragments from construction contracts—semantically similar but legally irrelevant.
Latency: searching an index of 1 million documents is slower than three indexes of 100k each.
Version control: updating the legal playbook should not affect the regulatory index.
Multi-index routing: architecture
INDEXES:
policy ← regulations, fares, legal terms
procedure ← step-by-step internal procedures
faq ← frequently asked questions
ROUTER RULES:
keyword "facturacion" → index: policy
keyword "procedimiento" → index: procedure
keyword "cómo puedo" → index: faq
fallback → index: faq
QUERY: "¿Cuánto me cobran por superar mi límite de datos?"
Router detects "cobran" → billing keyword → route to policy
Only searches policy → 0 noise from procedure or faq
Latency: 30ms (1 index) vs 90ms (3 indexes in parallel)
Two routing strategies
1. Keyword matching (deterministic)
for rule in rules:
if rule.keyword in query.lower():
return rule.index
return fallback
Advantages: microseconds, predictable, debuggable. Disadvantages: requires manual maintenance of the keyword glossary.
2. Intent-based routing (lightweight ML)
Uses the model.intent classifier (lightweight embeddings, ~5-10ms) to detect query intent and route by label:
intent("¿cuánto me cobran?") → "facturacion" → policy
intent("cómo configuro el router?") → "soporte_tecnico" → procedure
Advantages: captures semantic variants ("¿cuánto es la tarifa?" → facturacion even without "cobran"). Disadvantages: requires training, can fail on ambiguous queries.
RAGorbit nodes
store.multi-index → groups several named Retrievers
retrieval.router → selects the correct index by rules[] or intent
Template 05 (Legal): three indexes, keyword routing
indexes: [playbook, regulations, precedent]
rules:
"indemniz" → playbook
"regulacion" → regulations
"precedente" → precedent
fallback: playbook
Template 07 (Telecom): three indexes, intent routing
indexes: [policy, procedure, faq]
rules:
facturacion → policy
soporte_tecnico → procedure
fallback: faq
9. Query rewriting and intent detection
Query rewriting
The rewriter normalizes the user's query before sending it to the retriever. Its two main functions:
1. Internal jargon normalization
"baja de plan" → "cancelación de servicio"
"roaming gringo" → "roaming internacional EE.UU."
"batería de la laptop" → "bateria litio portatil equipaje cabina"
This is a mapping from internal/colloquial terms to canonical terms that appear in indexed documentation. Without this step, BM25 fails (no term match) and the vector may fail (the colloquial term's embedding differs from the technical one).
2. Query expansion
Adds related terms to improve BM25 recall:
Original query: "RM rodilla"
Expanded query: "resonancia magnética rodilla menisco cartílago articulación"
This is especially useful in medical or legal domains where users submit short queries and documents use full terminology.
Intent detection as the RAG gate
Intent detection is not only for routing: its first function is to be the gate that decides whether the query deserves activating the RAG pipeline at all.
CALL CENTER AUDIO FRAGMENTS:
"Oiga, y si viajo a Cancún..." → intent: facturacion (score 0.71) → RAG
"Sí, claro, aja... un momento" → intent: no_accionable (score 0.82) → DISCARD
"¿Cuánto cuesta el plan familiar?" → intent: facturacion (score 0.88) → RAG
Without this gate, 30-50% of audio fragments activate RAG unnecessarily, generating noise on the agent panel and consuming resources.
RAGorbit nodes
query.rewrite → normalizes jargon, expands terms
query.intent → detects intent, filters non-actionable, routes
model.intent → lightweight classifier (embeddings or small-LLM)
The difference between query.intent and model.intent in RAGorbit is that query.intent is oriented to the RAG gate (produces Decision and Query), while model.intent is the underlying classification model usable in more general contexts.
Full query ops pipeline (Template 07)
Audio → STT → model.intent → [if no_accionable: discard]
→ [if actionable: query.rewrite → retrieval.router → ...]
This pipeline removes noise before the first vector store call, with latency of only ~15 ms (intent: 10ms + rewrite: 5ms).
10. GraphRAG and knowledge graphs (Neo4j)
When vectors are not enough
Embeddings capture text semantics but not structural relationships. Consider:
"¿Qué procedimientos están afectados por la Directiva de Aeronavegabilidad AD-2024-0023?"
With vectors:
The query becomes a vector
Similar chunks are searched → may find some procedures
But CANNOT navigate: AD-2024-0023 → afecta a → SB-2023-32-001 → requiere → Task 32-11-001
With knowledge graph:
AD-2024-0023 is a node
It has typed relations: AFECTA_A → [SB-2023-32-001, SB-2023-32-002]
Each SB has: REQUIERE → [Task 32-11-001, Task 32-11-002]
A neighborhood query returns the whole subgraph in 1-2 hops
Knowledge graph fundamentals
Node: A domain entity. In an AMM: a procedure, an airworthiness directive, a part, a certified technician.
Relation (typed edge): A connection with semantics. Not just "A is related to B", but "AFECTA_A", "REQUIERE", "REEMPLAZA_A", "ES_PREREQUISITO_DE".
Neighborhood: The set of nodes and relations 1 or more hops from a given node. "Neighborhood" retrieval is what distinguishes GraphRAG from vector RAG.
GRAPH (partial view — AMM domain):
[AD-2024-0023] --AFECTA_A--> [SB-2023-32-001]
|
REQUIERE
|
[Task 32-11-001] --ES_PARTE_DE--> [Seccion 32-11-00]
|
PREREQUISITO
|
[Task 07-11-001] (jack de mantenimiento)
QUERY: "qué tareas requiere AD-2024-0023?"
GRAPH TRAVERSAL: AD-2024-0023 → AFECTA_A → SBs → REQUIERE → Tasks
RESULT: [Task 32-11-001, Task 07-11-001 (transitive)]
Neo4j and the store.neo4j node
Neo4j is the most widely used graph database in production. Its two main advantages:
- Cypher: declarative graph query language, very readable.
- Embeddings on nodes: Neo4j supports storing embeddings on nodes and doing vector search on them, combining vector search and graph traversal.
-- Cypher: find all documents related to a directive
MATCH (ad:Directive {id: "AD-2024-0023"})-[:AFECTA_A*1..2]->(doc:Document)
RETURN doc.text, doc.section, doc.revision
The store.neo4j node in RAGorbit:
- Creates
Chunknodes with their text, metadata, and embedding. - Creates typed relations between chunks according to document structure (
entitySchema). - With
buildRelations: true, the node infers relations automatically (section parent-child, entity co-occurrence). retrieval.graphretrieves by vector similarity on nodes AND by neighborhood traversal.
Hybrid graph + vector retrieval
The GraphRAG flow combines both capabilities:
1. Vector search on graph nodes
→ finds the 3 nodes most similar to the query
2. Graph traversal from those nodes
→ expands 1-2 hops following typed relations
→ collects the context subgraph
3. Returns: vector search nodes + neighborhood
This is especially powerful when the answer to a question is not in a single chunk but in the structure of relations between multiple chunks.
When to use graphs vs vectors
| Situation | Use |
|---|---|
| Corpus with explicit, complex relations between entities | Graph |
| "What affects what?", "What requires what?" questions | Graph |
| Text- and semantics-based retrieval | Vector |
| Corpus without clear relation structure | Vector |
| When graph maintenance is too costly | Vector |
| When extra precision is worth Neo4j overhead | Graph |
Microsoft GraphRAG
Microsoft Research published a framework called GraphRAG in 2024 that takes the concept further: it uses an LLM to extract entities and relations from the corpus (building the graph automatically), then uses the graph to answer global-level questions ("what are the main themes of the corpus?") that vector RAG cannot answer well.
The key difference from RAGorbit's store.neo4j is that Microsoft GraphRAG uses "communities" (clustering of related entities) to answer holistic questions. RAGorbit uses the graph mainly for neighborhood traversal on specific queries.
11. Technology comparison
Retrievers
| Method | Precision | Recall | Latency | When |
|---|---|---|---|---|
| Pure BM25 | High (exact) | Low (limited semantics) | Very low | Exact IDs, technical terms |
| Pure vector | Medium-high | High | Low | Natural language, semantics |
| Hybrid | High | High | Medium | General case |
| GraphRAG | Very high (structure) | Medium | High | Complex relations |
Rerankers
| Model | Quality | Latency | Cost | When |
|---|---|---|---|---|
| BGE-reranker-v2 | Very high | 50-150ms local | Free | Production without cloud |
| Cohere Rerank v3 | Very high | 100-300ms API | Pay per use | Prototyping, English |
| ColBERT | High | 20-80ms | Free | Large scale |
| FlashRank | Medium-high | 5-20ms | Free | Edge, critical latency |
Fusion strategies
| Method | When to prefer |
|---|---|
| RRF (Reciprocal Rank Fusion) | Scores from different scales (BM25 and cosine) |
| Normalized weighted sum | Scores on same scale, fine alpha control |
| Cross-encoder (reranker) | Maximum precision, tolerable latency |
Frameworks: LangChain vs LlamaIndex for retrieval
| Aspect | LangChain | LlamaIndex |
|---|---|---|
| Built-in retrievers | EnsembleRetriever, BM25Retriever, ContextualCompressionRetriever | SparseTopKRetriever, HybridFusion, RankGPT |
| Rerankers | ContextualCompressionRetriever + Cohere/BGE | CohereRerank, SentenceTransformerRerank, RankLLM |
| Graph RAG | Neo4j Graph RAG Toolkit integration | NebulaGraphStore, Neo4jGraphStore |
| Multi-index | MultiVectorRetriever, MergerRetriever | RouterRetriever, MultiIndexRetriever |
| When to prefer | When the rest of the stack already uses LangChain/LCEL | When the focus is advanced retrieval with many strategies |
12. RAGorbit nodes
retrieval category
| Node | Description | When |
|---|---|---|
retrieval.vector |
Similarity search with optional hard filters | Base case, semantic domain |
retrieval.hybrid |
Vector + BM25 fused (parameter alpha) |
Domain with technical jargon + natural language |
retrieval.graph |
Similarity retrieval + neighborhood traversal (Neo4j) | Complex relations between entities |
retrieval.router |
Selects index by keyword/intent | Multi-index, reduce noise and latency |
retrieval.parent-child |
Retrieves children for precision, returns parents for context | Long hierarchical documents |
retrieval.reranker |
Reorders and trims with cross-encoder | Always after retrieve in critical domains |
query category
| Node | Description | When |
|---|---|---|
query.rewrite |
Normalizes jargon, expands terms | Domain with corporate/technical vocabulary |
query.intent |
Detects intent, filters non-actionable, routes | RAG gate, reduce unnecessary calls |
Recommended production pipeline
User
↓ Message
query.intent ← gate: is it actionable?
↓ Query (if actionable)
query.rewrite ← normalizes jargon, expands
↓ Query
retrieval.router ← selects correct index
↓ Chunks (noisy top-K)
retrieval.reranker ← reorders, keeps top-3
↓ Chunks (precise)
logic.prompt + logic.citations
↓ Message (with citations)
io.output
13. Layer ③ explained: LangChain retrievers from scratch
Prerequisite: complete layer ② of the workshop (
lab/solucion_scratch.py) — BM25, cosine, RRF, rerank, and hard filter implemented by hand. Without that, this section will seem like magic.LangChain reminder (not re-explained here): In M1, §11, you learned what LangChain is, the
Documentobject,HuggingFaceEmbeddings/OpenAIEmbeddings,Chroma.from_documents, theRetrieverabstraction (as_retriever,.invoke), and the LCEL pattern. This section teaches only what's new in M4: specialized retrievers for hybrid search, fusion, reranking, and hard filtering.Environment: the course study machine has no
pipor network. You will not run this code here. The goal is that, withpip install langchain langchain-community rank-bm25 sentence-transformers chromadb, you can writelab/solucion_framework.pyyourself.
13.1 The problem this layer solves
In solucion_scratch.py you wrote ~300 lines for: tokenize, compute BM25, embed with bag-of-words, fuse with RRF, rerank by token intersection, and filter by fare_class. It works and is deterministic. But in production you need:
- Optimized BM25 (not a Python loop over 50k docs).
- Real semantic embeddings (not bag-of-words).
- Hybrid fusion without reimplementing RRF.
- A real cross-encoder (BGE-reranker), not token intersection.
- Everything wired with the same interface so you can swap pieces.
LangChain gives you composable retrievers: each implements the same interface and you chain them like LEGO blocks.
SCRATCH (M4 lab) LANGCHAIN (M4 lab)
──────────────────── ────────────────────────────────────
tokenizar() + BM25 manual ────────▶ BM25Retriever.from_documents(docs)
embed BoW + coseno manual ────────▶ Chroma + HuggingFaceEmbeddings + as_retriever
rrf_fusion() manual ────────▶ EnsembleRetriever(retrievers=[...], weights=[...])
rerank por intersección ────────▶ CrossEncoderReranker + ContextualCompressionRetriever
filtrar lista Python ────────▶ crear_retriever_filtrado() o filter en Chroma
13.2 Bridge table: scratch → LangChain (M4)
| What you did by hand (layer ②) | LangChain piece (layer ③) | Concept section |
|---|---|---|
bm25_score() + ranking over corpus |
BM25Retriever.from_documents(docs) + .k attribute |
§3 BM25 |
embed_bow() + similitud_coseno() |
HuggingFaceEmbeddings + Chroma.from_documents + as_retriever(search_kwargs={"k":...}) |
§2 Dense search + M1 §11 |
rrf_fusion(bm25_rank, vector_rank, k=60) |
EnsembleRetriever(retrievers=[...], weights=[...]) |
§4 Hybrid / RRF |
rerank_interseccion() (cross-encoder proxy) |
CrossEncoderReranker + ContextualCompressionRetriever |
§5 Reranking |
Filter CORPUS by fare_class before search |
crear_retriever_filtrado() or search_kwargs={"filter": {...}} in Chroma |
§7 Hard filters |
main() prints top-3 with/without filter |
.get_relevant_documents(query) or .invoke(query) |
M1 §11 (Retriever interface) |
13.3 Retriever as a composable interface
In M1 you learned that a Retriever is any object that, given a query (string), returns list[Document]. The minimal interface:
docs = retriever.invoke("¿puedo hacer cambios sin cargo?")
# docs: list[Document] with page_content and metadata
The legacy alias .get_relevant_documents(query) also exists — it does the same thing. In new code prefer .invoke().
The key M4 idea: you can stack retrievers. A retriever can contain other retrievers:
┌─────────────────────────────────────┐
│ ContextualCompressionRetriever │
│ (reranker on top of ensemble) │
└──────────────────┬──────────────────┘
│ base_retriever
┌──────────────────▼──────────────────┐
│ EnsembleRetriever │
│ (RRF fusion of BM25 + vector) │
└──────────┬─────────────┬────────────┘
│ │
┌──────────▼──┐ ┌──────▼──────────┐
│ BM25Retriever│ │ vector_retriever │
│ (keyword) │ │ (Chroma/dense) │
└─────────────┘ └─────────────────┘
Each box speaks list[Document] upward. You only call .invoke(query) on the outermost retriever.
13.4 Document with filter metadata (brief reminder)
In the lab, each policy from the JSON becomes a Document:
from langchain.schema import Document
documentos = [
Document(
page_content=item["texto"],
metadata={
"id": item["id"],
"fare_class": item["metadata"]["fare_class"], # ← key for hard filter
"route_type": item["metadata"]["route_type"],
"categoria": item["metadata"]["categoria"],
},
)
for item in raw
]
page_content= the text BM25 and the vector store index.metadata["fare_class"]= the dimension you use in the hard filter (§7). Without correct metadata, the filter cannot work.
Full Document detail: M1 §11.3.
13.5 BM25Retriever — your manual BM25, packaged
What it does: builds an in-memory BM25 index over a list of Document using the rank-bm25 library (the same §3 formula, optimized).
from langchain_community.retrievers import BM25Retriever
bm25_retriever = BM25Retriever.from_documents(documentos)
bm25_retriever.k = 9 # how many documents to return per query (equivalent to your top-k)
docs = bm25_retriever.invoke("cambios sin cargo adicional")
# docs[0].metadata["id"] → probably pol_008 (Top) without filter
| Parameter / attribute | What it controls | Scratch equivalent |
|---|---|---|
.from_documents(docs) |
Builds the BM25 index | Your IDF + TF loop over CORPUS |
.k |
Top-k to return | Your BM25 ranking [:9] |
When to use: domains with exact terms ("cambios", "sin cargo", ATA codes). When NOT: if you only need semantics and there are no exact identifiers — a vector retriever alone may suffice.
Gotcha: BM25Retriever does not accept metadata filter. If you need a hard filter, pass only already-filtered Documents (see §13.10).
13.6 Vector retriever — Chroma + local embeddings
What it does: indexes Documents with dense embeddings and exposes a cosine similarity retriever.
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# Local model ~80MB; first run downloads from Hugging Face
vector_store = Chroma.from_documents(documentos, embeddings)
vector_retriever = vector_store.as_retriever(search_kwargs={"k": 9})
docs = vector_retriever.invoke("cambios de vuelo sin pagar")
| Piece | Role | M1 reminder |
|---|---|---|
HuggingFaceEmbeddings |
Converts text → list[float] |
M1 §11.6 — Embeddings interface |
Chroma.from_documents |
Persists vectors + metadata | M1 §11.7 — vector store |
as_retriever(search_kwargs={"k": N}) |
Returns top-N by similarity | M1 §11.8 — Retriever |
Difference from your scratch: your BoW embedding does not capture that "modificación de fecha" and "cambio de vuelo" are semantically close. all-MiniLM-L6-v2 does — which is why the framework's vector ranking may differ from scratch, but the pattern (without filter → noise from other fares) holds.
When to use: natural language, synonyms, long queries. When NOT: exact ID search only with no semantic variation.
13.7 EnsembleRetriever — your manual RRF, automated
What it does: runs several retrievers in parallel, fuses their rankings with Reciprocal Rank Fusion (RRF, §4) using c=60 by default — the same k=60 as your rrf_fusion().
from langchain.retrievers import EnsembleRetriever
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.4, 0.6], # 40% BM25, 60% vector
)
docs = ensemble_retriever.invoke(QUERY)
How it works internally (mapping to your scratch):
Your scratch: EnsembleRetriever:
───────────────── ─────────────────────
for rank, doc in bm25_results: Runs retriever[0].invoke(query)
score += 1/(60+rank) Runs retriever[1].invoke(query)
for rank, doc in vector_results: Fuses with RRF (c=60)
score += 1/(60+rank) Applies weights as tiebreaker
sort by score desc Returns list[Document]
On weights: they are NOT multipliers of BM25 vs cosine scores (incompatible scales — which is why RRF uses ranks, §4). In EnsembleRetriever, weights influence when a document appears in only one list: a doc found only by the vector retriever gets a boost proportional to weights[1]. If it appears in both lists, RRF already scored it for both positions.
| weights | Practical interpretation |
|---|---|
[0.5, 0.5] |
BM25 / vector tie |
[0.4, 0.6] |
More confidence in semantics (conversational domain) |
[0.7, 0.3] |
More confidence in keywords (technical domain with exact IDs) |
When to use: whenever you want BM25+vector hybrid (general case, §4). When NOT: if one retriever is clearly useless in your domain — better remove it than give it weight 0.01.
Gotcha: EnsembleRetriever has no hardFilter. The filter must be applied before (§13.10).
13.8 Reranking — CrossEncoderReranker + ContextualCompressionRetriever
What it does: takes the base retriever output (ensemble), reorders it with a cross-encoder (§5), and trims to top_n.
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
cross_encoder = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
reranker = CrossEncoderReranker(model=cross_encoder, top_n=3)
compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker, # the "compressor" reorders and trims
base_retriever=ensemble_retriever, # where candidates come from
)
docs = compression_retriever.invoke(QUERY) # at most 3 docs, reordered
"Compression/reranking over base retriever" pattern:
Query
│
▼
base_retriever (Ensemble) ──▶ top-9 candidates (high recall, §5)
│
▼
base_compressor (Reranker) ──▶ reorders query+doc TOGETHER (high precision)
│
▼
final top-3
The name "ContextualCompression" is historical: it originally compressed long documents. In RAG practice, the most common use is reranking — which is why the base_compressor is almost always a reranker.
Cloud alternative (no local model):
# from langchain_cohere import CohereRerank
# reranker = CohereRerank(model="rerank-multilingual-v3.0", top_n=3)
CohereRerank follows the same pattern: pass it as base_compressor to ContextualCompressionRetriever. Requires COHERE_API_KEY but avoids downloading BGE (~400MB).
| Option | Advantage | When |
|---|---|---|
CrossEncoderReranker + BGE |
Local, free, reasonably multilingual | Production without cloud |
CohereRerank |
Very easy, high quality in English | Rapid prototyping with API |
When to use reranker: critical domains (legal, healthcare, airline) where the retriever's top-5 has noise in positions 1-2. When NOT: latency < 100ms or corpus < 500 docs where the retriever is already precise.
Gotcha: the reranker receives base_retriever output. If the ensemble returns k=3, the reranker can only reorder 3 docs — set ensemble k high (9-20) and let the reranker trim to top_n=3.
13.9 Hard filter — why it is not in EnsembleRetriever
EnsembleRetriever fuses BM25 and vector lists. It has no hardFilter parameter because:
- BM25 does not natively support metadata filters.
- Chroma supports
filterinsearch_kwargs, but that filter only applies to the vector retriever — BM25 would still return docs from other fares.
Strategy A (recommended in the lab): filter the corpus before building retrievers:
def crear_retriever_filtrado(fare_class: str):
docs_filtrados = [d for d in documentos if d.metadata["fare_class"] == fare_class]
bm25_filtrado = BM25Retriever.from_documents(docs_filtrados)
# ... rebuild vector store, ensemble, and compression retriever
Strategy B (vector only): filter in Chroma without rebuilding the index:
vector_retriever = vector_store.as_retriever(
search_kwargs={"k": 9, "filter": {"fare_class": "Basic"}}
)
This filters the vector retriever, but ensemble BM25 would still be unfiltered — guaranteed noise. That is why strategy A is robust (same as your scratch: filter CORPUS at the start).
Full justification in §7.
13.10 Block-by-block walkthrough: lab/solucion_framework.py
Open lab/solucion_framework.py and follow this map. Each block corresponds to something you already wrote by hand.
BLOCK 1 — Load corpus → Documents
─────────────────────────────────────
JSON → list[Document] with metadata fare_class, route_type, categoria
Why? BM25Retriever and Chroma consume Document, not loose dicts.
BLOCK 2 — BM25Retriever
────────────────────────
BM25Retriever.from_documents(documentos); bm25_retriever.k = 9
Why k=9? Corpus of 9 policies; we want all candidates
so the ensemble has material to fuse.
BLOCK 3 — Vector store + retriever
────────────────────────────────────
HuggingFaceEmbeddings + Chroma.from_documents + as_retriever(k=9)
Why all-MiniLM-L6-v2? Lightweight local model; sufficient for the lab.
BLOCK 4 — EnsembleRetriever
─────────────────────────────
retrievers=[bm25, vector], weights=[0.4, 0.6]
Why? Replicates your scratch rrf_fusion() with internal RRF c=60.
BLOCK 5 — Reranker + Compression retriever
────────────────────────────────────────────
CrossEncoderReranker(BGE, top_n=3) wrapped in ContextualCompressionRetriever
Why? Replicates your intersection rerank, but with a real cross-encoder.
BLOCK 6 — crear_retriever_filtrado()
──────────────────────────────────────
Filter docs → rebuild BM25 + Chroma + Ensemble + Compression
Why rebuild everything? EnsembleRetriever does not filter; BM25 has no filter.
BLOCK 7 — Execution with/without filter
────────────────────────────────────
compression_retriever.invoke(QUERY) vs crear_retriever_filtrado("Basic").invoke(QUERY)
Why? Demonstrate the same pattern as expected.md from scratch.
Expected result (same pattern as scratch):
| Mode | Top-3 fare_class | Noise |
|---|---|---|
| Without filter | Top, Plus, Basic mixed | Yes — pol_008 (Top) probably first |
With filter Basic |
Basic only | No — pol_002, pol_003, pol_001 |
The framework may rank slightly differently from scratch (real embeddings vs BoW), but the noise check must be the same: without filter there are wrong fares; with filter, Basic only.
13.11 When to use / NOT use each piece
| Piece | Use when | Do NOT use when | Main gotcha |
|---|---|---|---|
BM25Retriever |
Exact IDs, technical jargon, rare terms | Conversational semantics only | No native metadata filter |
| Vector retriever (Chroma) | Natural language, synonyms | Exact code search only | Chroma filter does not affect ensemble BM25 |
EnsembleRetriever |
General hybrid case | One retriever clearly dominant | weights ≠ weighted-sum alpha (§4) |
CrossEncoderReranker |
Critical precision, noisy top-k | Latency < 100ms, small corpus | Needs high k on base retriever |
ContextualCompressionRetriever |
Whenever you add a reranker | — | Name is confusing; it is a rerank wrapper |
| Hard filter pre-corpus | Safety guardrail (§7) | Soft prompt filter is enough (rare) | Post-filtering after the LLM is too late |
13.12 Full pipeline diagram (framework)
politicas.json
│
▼
list[Document] ──────────────────────────────────────────────┐
│ │
│ full corpus (9 docs) │ filtered docs (3 Basic)
▼ ▼
┌─────────┐ ┌──────────────┐ ┌─────────┐ ┌──────────────┐
│ BM25 │ │ Chroma+HF │ │ BM25 │ │ Chroma+HF │
│ k=9 │ │ Embeddings │ │ k=3 │ │ k=3 │
└────┬────┘ └──────┬───────┘ └────┬────┘ └──────┬───────┘
│ │ │ │
└──────┬───────┘ └──────┬───────┘
▼ ▼
EnsembleRetriever EnsembleRetriever
weights=[0.4,0.6] weights=[0.4,0.6]
RRF c=60 RRF c=60
│ │
▼ ▼
ContextualCompressionRetriever ContextualCompressionRetriever
+ BGE reranker top_n=3 + BGE reranker top_n=3
│ │
▼ ▼
WITHOUT FILTER: Top, Plus, Basic WITH FILTER: Basic only
(noise — §7) (correct — §7)
13.13 Next step: write the framework yourself
- Read
lab/enunciado.md— the Layer ③ section has staged hints pointing here. - Try writing
lab/solucion_framework.pywithout looking at the solution. - Compare with
lab/solucion_framework.pyandlab/solucion.md. - When you have
pipand network, run it and verify the with/without filter pattern matcheslab/expected.md.
Useful cross-links:
- BM25 concepts: §3
- RRF fusion: §4
- Cross-encoder: §5
- Hard filter: §7
- Scratch workshop:
lab/enunciado.md·lab/expected.md - LangChain base: M1 §11
Beyond Lang*: retrievers and the full RAG pipeline can also be built with LlamaIndex (query engines/retrievers), Haystack, and the native SDK + Chroma — see
../referencia/rag-sin-langchain.md.Strategy landscape: beyond hybrid + rerank, there is a full catalog of RAG architectures (HyDE, RAG-Fusion, RAPTOR, Contextual Retrieval, ColBERT, Self-RAG, CRAG, Adaptive/Agentic RAG…). When to apply each in
../referencia/panorama-estrategias-rag.md.
14. Checkpoint
You know it if you can:
- Explain the BM25 formula and why each parameter (IDF, k1, b) exists.
- Describe RRF and manually compute the fused score for 3 documents.
- Explain why a cross-encoder is more precise than a bi-encoder and when the extra latency is not worth it.
- Design a hard filter for a high-consequence domain and argue why a prompt instruction is not enough.
- Design multi-index routing for the telecom case (3 indexes, keyword + intent rules).
- Explain when a knowledge graph beats vectors and when it does not.
- Map the 6
retrievalnodes and 2querynodes to their use cases. - Explain what each LangChain retriever does (
BM25Retriever,EnsembleRetriever,ContextualCompressionRetriever) and map it to what you implemented in scratch. - Write a LangChain hybrid + rerank + hard filter pipeline without copying the lab solution.
What to review if something is unclear:
- BM25: reread §3 with a 5-document example corpus and compute scores by hand.
- Cross-encoder: reread §5, run the workshop scratch to see the difference with and without reranker.
- Hard filters: read templates 03 and 08 §9 for real-context justification.
- GraphRAG: explore
examples/05-legal-contract-review/flow.jsonand thestore.neo4jnode indocs/02-node-catalog.md. - LangChain retrievers: reread §13, write
lab/solucion_framework.pyguided bylab/enunciado.md.