📥

Data ingestion

M2 · Data ingestion — `loader` + `ingest`

Module goal: understand how raw data (PDFs, tables, web, SQL, S3, images) is converted into chunks with metadata ready to be indexed in a vector store.

RAGorbit nodes covered: loader.*, ingest.chunker, ingest.metadata

Reference templates: 05-legal-contract-review, 02-banking-credit-scoring, 08-manufacturing-maintenance-rag, 04-insurance-claims

The ingestion problem
Data sources and loaders
Parsing: from raw format to structured text
Chunking in depth
Metadata and its role in hard filters
Multimodal: tables and diagrams
Ingestion framework comparison
Full pipelining in RAGorbit
When to use / when not to / alternatives
Layer ③ explained: chunking with LangChain from scratch
Checkpoint

1. The ingestion problem

Before an LLM can answer questions about your company's documents, those documents must go through an ingestion pipeline: load → parse → chunk → metadata → index.

This process looks simple but hides most production RAG failures. Four frequent problems:

Problem	Production symptom	Root cause
Chunks too large	The LLM ignores parts of the context (window full)	Excessive `chunkSize`
Chunks too small	The LLM lacks enough context to answer	Insufficient `chunkSize` or zero `overlap`
Split clause	The answer mixes obligations from different clauses	Character-based chunking on legal text
No metadata	You cannot filter by document type or date	Missing `ingest.metadata`

The right approach is to choose the chunking strategy according to document structure and enrich each chunk with metadata that enables hard filters in the retriever.

Documentos crudos
      │
      ▼
 ┌─────────┐    parsing    ┌──────────────┐   chunking   ┌────────────┐
 │ Loader  │ ────────────▶ │  texto limpio│ ────────────▶ │  chunks[]  │
 └─────────┘               └──────────────┘              └────────────┘
                                                                │
                                                          metadata
                                                                │
                                                                ▼
                                                    ┌─────────────────────┐
                                                    │ {text, metadata,    │
                                                    │  source, chunk_id}  │
                                                    └─────────────────────┘

2. Data sources and loaders

2.1 The six loader types in RAGorbit

The catalog docs/02-node-catalog.md defines six loader.* types. All produce Documents (a list of {text, metadata} objects):

Node	Source	Key config	When to use
`loader.pdf`	Text PDFs	`ocr: false/true`	Contracts, policies, selectable-text PDF manuals
`loader.multimodal`	PDFs with tables and diagrams	`extractTables: true`, `describeImages: true`, `sectionScheme`	Technical manuals (AMM), insurance forms with images
`loader.tabular`	CSV/Parquet/Excel	`schemaHint`	Financial data, inventories, sensor logs
`loader.web`	Web pages / sitemaps	`urls[]`, `crawlDepth`	Public FAQs, API documentation, news
`loader.s3`	S3/GCS objects	`bucket`, `prefix`	Document repositories at scale (millions of PDFs)
`loader.sql`	Database rows	`query`	Product catalogs, customer data, logs

2.2 When to OCR and when not to

PDFs have two variants:

Selectable PDF (text-based): text is encoded in the file. loader.pdf with ocr: false extracts text in milliseconds.
Scanned PDF (image-based): the PDF is a photo. OCR is required. ocr: true enables Tesseract or an external service (slower and costlier).

Practical rule: use ocr: true only when you confirm the PDF is scanned. OCR introduces typos that contaminate the index.

2.3 loader.sql: converting rows into documents

loader.sql runs a query and converts each row into a document. Example: the query SELECT sku, descripcion, especificaciones FROM productos WHERE activo = true produces one document per product. This enables RAG over product catalogs without exporting to CSV.

When to use: when data lives in an operational DB and you want ingestion always synced with the source (by running the query periodically).

Alternative: loader.s3 or loader.tabular if data is already exported.

2.4 Connection with templates

Template 02 (Banking): uses loader.pdf (tax returns) + loader.tabular (financial CSV) → ingest.chunker with strategy: by-section.
Template 05 (Legal): uses loader.pdf (contracts, playbook, regulations) → ingest.chunker with strategy: by-clause.
Template 08 (Manufacturing): uses loader.multimodal with sectionScheme: ATA to preserve the manual's chapter structure.
Template 04 (Insurance): uses loader.multimodal to extract coverage tables and describe damage photos.

3. Parsing: from raw format to structured text

Parsing converts the binary of the original format (PDF, XLSX, HTML) into clean text. It is the quietest step in the pipeline but the one that most affects index quality.

3.1 PDF parsing under the hood

loader.pdf uses libraries like pdfminer or pypdf to extract text while preserving reading order. Most common problems:

Multiple columns: a two-column PDF may extract as interleaved text if the library follows character flow instead of visual flow.
Headers/footers: can contaminate main text. Advanced tools (Unstructured.io) detect and filter these regions.
Special characters: typographic ligatures (ﬁ, ﬂ), dash characters (—, -, –), and curly quotes (", ") may remain as odd characters if the PDF does not embed fonts correctly.

Practical fix: normalize text after extraction:

import unicodedata
texto_limpio = unicodedata.normalize("NFKC", texto_crudo)

3.2 Tabular parsing

loader.tabular reads CSV/Parquet with pandas (or equivalent). The schemaHint config helps the loader interpret ambiguous columns. For example, a periodo column may be a string "2023-Q3" or an integer 20234.

Conversion to text: each row becomes readable text:

concepto: ingreso_anual | valor: 85000 | periodo: 2023

This enables semantic similarity search over data that would otherwise be only numbers.

3.3 Web parsing

loader.web downloads HTML and extracts visible text (removing scripts, styles, navigation menus). Crawl depth (crawlDepth) controls how many link levels to follow.

Problem: web HTML changes frequently. A RAG system that indexes web content needs periodic re-ingestion. If content is stable (versioned technical documentation), prefer loader.s3 or loader.pdf.

4. Chunking in depth

Chunking is the most important design decision in the ingestion pipeline. A poorly sized or poorly delimited chunk contaminates the whole chain: embeddings are less precise, the retriever returns wrong context, and the LLM answers with mixed information.

4.1 Strategy 1 — Fixed chunking (fixed size)

Splits text into blocks of N characters (or N tokens), with an overlap of O characters between consecutive blocks.

Texto original:
  [──────── 1000 chars ────────][──────── 1000 chars ────────]
                            [── overlap 200 ──]

Chunks resultantes:
  Chunk 0: chars 0..1000
  Chunk 1: chars 800..1800    ← overlap cubre el contexto de transición
  Chunk 2: chars 1600..2600

ASCII diagram:

TEXTO: "La indemnización...límite de 2×...plazo de 30 días..."
        |<──── 1000 ────>|<──200──>|<──── 1000 ────>|
        Chunk 0           overlap    Chunk 1

When to use:

Documents without clear semantic structure (continuous text, voice transcripts).
As a fallback when you lack a structural parser.
Quick prototypes.

When NOT to use:

Contracts and regulations (splits clauses in half).
Technical manuals with tables and procedures (mixes steps from different procedures).
Any document where the natural semantic unit is not the paragraph.

Config in RAGorbit:

{ "strategy": "recursive", "chunkSize": 1000, "overlap": 150 }

4.2 Strategy 2 — Recursive chunking (hierarchical separators)

Tries separators in order of semantic preference. If the resulting chunk exceeds chunkSize, applies the next separator.

Typical hierarchy: \n\n (paragraphs) → \n (lines) → . (sentences) → (words)

TEXTO con párrafos bien marcados:
┌──────────────────────────────────────┐
│ Párrafo 1 (400 chars)                │ ← chunk 0 (cabe en 1000)
├──────────────────────────────────────┤
│ Párrafo 2 (600 chars)                │ ← chunk 1 (cabe en 1000)
├──────────────────────────────────────┤
│ Párrafo 3 larguísimo (2000 chars)    │ ← se parte por oraciones
│   Oración 1 (400)                    │   chunk 2
│   Oración 2 (300)                    │   chunk 3
│   Oración 3 + Oración 4 (900)        │   chunk 4
└──────────────────────────────────────┘

When to use:

Documents with paragraph structure (articles, reports, company policies with sections).
When you want to respect natural structure without knowing the domain.

When NOT to use:

When documents have very domain-specific structure (numbered clauses, ATA chapters, tables). In that case, use domain semantic strategies.

Config in RAGorbit: this is the default — strategy: recursive.

4.3 Strategy 3 — Semantic chunking (by semantic similarity)

Computes embeddings of consecutive sentences and cuts where similarity falls below a threshold. Each chunk is a coherent "thematic block".

Oraciones con su embedding:
  S1 ─── S2 ─── S3 ─── S4 ─── S5 ─── S6
         │similitud alta│      │baja│   │alta│
                          ← corte →    ← corte →

Chunks resultantes:
  Chunk A: S1+S2+S3
  Chunk B: S4
  Chunk C: S5+S6

Advantage: chunks have semantic coherence even when the document has no structural markers.

Disadvantage: requires computing embeddings during ingestion (more costly), and the threshold must be calibrated per document type.

When to use:

Narrative text without explicit structure (annual reports, testimonials, transcripts).
When visible paragraphs do not correspond to real semantic units.

In RAGorbit: there is no native strategy: semantic node. It is implemented in layer ③ with LangChain SemanticChunker or LlamaIndex SemanticSplitterNodeParser.

4.4 Strategy 4 — By-layout chunking (visual/HTML structure)

Leverages document structure: titles, subtitles, lists, tables. Tools like Unstructured.io classify each PDF block ("Title", "NarrativeText", "Table", "ListItem") and group them semantically.

PDF con estructura:
┌─────────────────────────────────────────┐
│ [Título] Capítulo 3. Resultados         │ ─── Chunk "Capítulo 3"
│ [NarrativeText] El análisis muestra...  │
│ [Table] | Año | Ingresos | Costos |     │ ─── Chunk tabla (→ JSON)
│         | 2022 | 1.2M    | 0.8M  |     │
│ [NarrativeText] La tabla anterior...   │ ─── Chunk "texto post-tabla"
└─────────────────────────────────────────┘

When to use:

Financial reports with tables and charts.
Technical documents where visual hierarchy (H1, H2, H3) is semantically relevant.

Tool: Unstructured.io (open source with cloud API). See §7.

4.5 Strategy 5 — By-clause/section chunking (domain-based)

Defines domain-specific separators: CLÁUSULA N. (contracts), ATA-XX-YY-ZZ (aircraft manuals), Artículo N. (regulations), SECCIÓN N. (policies).

This is the most precise strategy when the domain has predictable structure.

Contrato legal:
CLÁUSULA 1. OBJETO  ←── separador de dominio
  texto...
CLÁUSULA 2. DURACIÓN  ←── separador de dominio
  texto...
CLÁUSULA 3. PAGO  ←── separador de dominio
  texto...

→ 3 chunks perfectos, sin overhead de overlap

When to use:

Contracts (by clause) — template 05-legal.
Technical manuals with ATA numbering — template 08-manufacturing.
Regulations with numbered articles.
Company policies with named sections.

When NOT to use:

Documents without clear semantic structure (narrative text).
When separators are not consistent across all corpus documents.

Config in RAGorbit:

{ "strategy": "by-clause", "chunkSize": 900, "overlap": 120 }

4.6 The overlap parameter

Overlap is the number of characters (or tokens) shared between consecutive chunks. Its role is to preserve context at the boundary between chunks.

Sin overlap:
  Chunk 0: "...La cláusula establece que el plazo"
  Chunk 1: "será de 30 días naturales. La penalización..."
  ← La oración queda partida; el retriever puede devolver solo Chunk 1
    y el LLM no sabe qué plazo son "30 días".

Con overlap de 50 chars:
  Chunk 0: "...La cláusula establece que el plazo"
  Chunk 1: "...que el plazo será de 30 días naturales. La penalización..."
  ← El contexto "que el plazo" se repite en Chunk 1, dando coherencia.

Empirical rule:

Overlap of 10-15% of chunkSize for narrative text (e.g.: chunkSize: 1000, overlap: 150).
Low or zero overlap for semantic chunks (by-clause, by-section): clauses are already autonomous units.
Excessive overlap (>30%) increases index size without proportional benefit.

4.7 Chunking strategy comparison

Strategy	Deterministic	Requires structure	Natural metadata	Ideal case
Fixed	yes	no	no	Quick prototype, free text
Recursive	yes	paragraphs	no	Articles, reports, policies
Semantic	no	no (uses embeddings)	no	Dense narrative text
By-layout	yes (with Unstructured)	visual structure	block type	Reports with tables, rich PDFs
By-clause/section	yes	domain structure	clausula_id, tipo	Contracts, technical manuals, regulations

5. Metadata and its role in hard filters

5.1 What metadata is in chunks

Each chunk in the vector store is more than text + embedding. It carries a metadata dictionary the retriever can use as a filter before computing similarity. This is what RAGorbit docs call "hard filters as guardrails".

chunk = {
    "text": "CLÁUSULA 9. CONFIDENCIALIDAD ...",
    "embedding": [0.023, -0.117, ...],   # generado por model.embedding
    "metadata": {
        "clausula_id": 9,
        "tipo": "confidencialidad",
        "contrato": "CSP-2024-0087",
        "fecha": "2024-01-15",
        "source": "contrato_muestra.txt"
    }
}

5.2 Hard filters vs. soft filters

Hard filter: WHERE condition in the vector store query. Chunks that do not meet the condition are not scored, regardless of similarity.
Soft filter: retrieve N chunks by similarity and then filter. "Wrong" chunks still consume topK.

Hard filter example in RAGorbit:

{
  "type": "retrieval.vector",
  "config": {
    "topK": 5,
    "hardFilters": ["aircraft_type", "ata_chapter"]
  }
}

At query time, the pgvector SQL query is:

SELECT * FROM chunks
WHERE aircraft_type = 'A320' AND ata_chapter = '32'
ORDER BY embedding <=> query_embedding
LIMIT 5;

An A320 technician never sees 787 torque limits, even if the embedding is similar.

5.3 Metadata fields by domain

Each domain has its canonical fields. The ingest.metadata table in RAGorbit supports any field:

Domain	Metadata fields	What to filter for
Aviation (template 08)	`aircraft_type`, `ata_chapter`, `revision_date`	Only chunks for the correct aircraft and chapter
Financial (template 02)	`doc_type`, `period`	Only documents from the applicant's fiscal period
Legal (template 05)	`clausula_id`, `tipo`	Only clauses of a specific type
Insurance (template 04)	`fare_class`, `cobertura`	Only policies of the contracted fare class
HR (template 09)	`departamento`, `nivel`, `version`	Only current policies for the department

5.4 How the `ingest.metadata` node produces these fields

In RAGorbit, the ingest.metadata node receives Documents from the chunker and labels each chunk. It can enrich metadata in three ways:

Loader propagation: the loader already adds source, page_number, etc.
Text extraction: domain regex or patterns (e.g., extract clause number from chunk text).
Session context: runtime metadata (e.g., aircraft_type comes from the user's session context).

5.5 Metadata and reproducibility

The fields contrato, fecha, and revision_date allow re-running exactly the same historical query. If an auditor asks "which manual version answered the technician on March 15, 2024?", the system can filter by revision_date <= 2024-03-15 and reproduce the answer.

6. Multimodal: tables and diagrams

6.1 The problem with rich PDFs

A technical manual PDF is not just text. It contains:

Tolerance tables: "maximum bolt torque: 45 Nm ± 5%"
Hydraulic diagrams: line numbers, valves, sensors
Figures with captions: "Fig. 32-11-00-991-010"

If you only extract text, you lose the semantic content of tables and diagrams. The retriever cannot find "bolt torque" because that information is in a table cell that the text extractor turned into "45 Nm ± 5%" without row/column context.

6.2 Tables → JSON

loader.multimodal with extractTables: true detects tables in the PDF and converts them to structured JSON:

{
  "tipo": "tabla",
  "titulo": "Límites de tolerancia — Tren de aterrizaje principal",
  "datos": [
    {"parametro": "juego_lateral_pivote", "min": "0.00 mm", "max": "0.35 mm", "unidad": "mm"},
    {"parametro": "torque_perno_superior", "nominal": "45", "tolerancia": "±5%", "unidad": "Nm"}
  ],
  "referencia": "Tabla 32-11-00-991-001"
}

This JSON is indexed as text. Now the query "what is the maximum lateral play of the pivot?" can retrieve this chunk and the LLM can answer "0.35 mm" with an exact citation.

6.3 Diagrams → vision → text

For diagrams, loader.multimodal with describeImages: true sends each figure to model.vision (Claude Opus 4.8 or another multimodal model). The model returns a text description:

"Diagrama del sistema hidráulico del tren de aterrizaje principal del A320.
Muestra el actuador hidráulico (referencia 10-43200-00) conectado a la línea
hidráulica verde (sistema 1) mediante dos válvulas de cierre. La presión
nominal del sistema es 3000 PSI. Figura 32-21-11-991-020."

This description is indexed and retrieved as normal text. The retriever can find "hydraulic actuator" even though the figure does not contain that text explicitly.

6.4 sectionScheme: ATA

The sectionScheme: ATA parameter tells the loader to preserve ATA numeric hierarchy (Chapter-Section-Subject: 32-11-00). This enables:

Chunking by ATA section: each section is an autonomous chunk with metadata.ata_chapter.
Hard filters: retrieval.vector can filter by ata_chapter: "32" before searching.

When to use sectionScheme: whenever the document has a standard numbering hierarchy (ATA, ISO, regulations with articles).

6.5 Limitations and when to scale

The multimodal pipeline is slower and costlier:

Table extraction: +50-200ms per page with tables.
Vision per diagram: 1-3s per vision model call, additional token cost.

Rule: only use extractTables: true and describeImages: true when tabular or visual content is essential to answer user questions. For an HR policy chatbot, you do not need vision. For aircraft maintenance manual RAG, it is essential.

7. Ingestion framework comparison

7.1 LangChain loaders

LangChain includes more than 100 loaders in langchain-community. They are generally simple wrappers around Python libraries:

from langchain_community.document_loaders import PyPDFLoader, CSVLoader, WebBaseLoader

# PDF
loader = PyPDFLoader("contrato.pdf")
docs = loader.load()  # una página = un Document

# CSV
loader = CSVLoader("datos.csv", metadata_columns=["doc_type", "period"])
docs = loader.load()  # una fila = un Document

# Web
loader = WebBaseLoader(["https://example.com/politica"])
docs = loader.load()

Pros: easy to install, integrates with the LangChain ecosystem (splitters, stores). Cons: extraction quality varies by underlying library; does not include vision by default; multimodal requires extensions.

7.2 LlamaIndex readers

LlamaIndex uses the term "reader" instead of "loader". The llama-hub ecosystem has readers for dozens of sources:

from llama_index.readers.file import PDFReader, CSVReader
from llama_index.core import SimpleDirectoryReader

# PDF con metadatos por página
reader = PDFReader()
docs = reader.load_data("contrato.pdf")  # carga con page_label

# Directorio completo (detecta tipo de archivo automáticamente)
reader = SimpleDirectoryReader("data/contracts/", recursive=True)
docs = reader.load_data()

Pros: LlamaIndex's Node abstraction carries richer metadata by default; native integration with its indexes and splitters. Cons: separate ecosystem from LangChain; steeper learning curve.

7.3 Unstructured.io

Unstructured is a tool specialized in parsing unstructured documents. It categorizes each document element:

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf("manual_tecnico.pdf", strategy="hi_res")
# elements es una lista de objetos tipados:
# Title("Capítulo 32 Landing Gear")
# NarrativeText("El tren de aterrizaje principal...")
# Table(text="| Parámetro | Min | Max |...", metadata={"page_number": 47})
# Image(metadata={"filename": "fig_32-11.png"})

Pros: best extraction quality for complex PDFs; detects tables, lists, titles, figures; hi_res mode uses computer vision for complicated layouts. Cons: slower than simple loaders; hi_res mode requires detectron2 (heavy) or the cloud API.

7.4 When to use each

Tool	Best for	Avoid if
LangChain loaders	Simple PDFs, CSVs, web; LangChain ecosystem	You need very high extraction quality
LlamaIndex readers	LlamaIndex ecosystem; rich metadata; multiple formats in one directory	You only use LangChain
Unstructured.io	Rich PDFs (complex tables, multiple columns, figures); maximum quality	You have limited resources or the PDF is simple
RAGorbit `loader.multimodal`	Technical manuals with `sectionScheme`; tables → JSON; diagrams → vision	The document is text-only without tables/images

8. Full pipelining in RAGorbit

8.1 `ingest.chunker` node

The node receives Documents from the loader and produces Documents (chunks). Key config:

{
  "type": "ingest.chunker",
  "config": {
    "strategy": "by-clause",
    "chunkSize": 900,
    "overlap": 120
  }
}

The three strategies the node supports:

recursive — RecursiveCharacterTextSplitter (default).
by-section — splits on section headers (#, ##, or domain patterns).
by-clause — splits on numbered clauses (CLÁUSULA N., Artículo N.).

8.2 `ingest.metadata` node

Receives Documents from the chunker and adds metadata:

{
  "type": "ingest.metadata",
  "config": {
    "fields": ["doc_type", "period", "aircraft_type", "ata_chapter"]
  }
}

Fields can be populated from three sources:

Propagated from loader (e.g.: source, page_number).
Extracted from chunk text with regex (e.g.: clausula_id from header).
Injected at runtime from session context (e.g.: aircraft_type from user JWT).

8.3 Typical pipeline

[loader.pdf]          [ingest.chunker]       [ingest.metadata]
  Documents ─────────▶   Documents ──────────▶  Documents
                          strategy: by-clause    fields: [clausula_id,
                          chunkSize: 900              tipo, contrato,
                          overlap: 120                fecha]
                                                       │
                                              ┌────────┘
                                              ▼
                                       [store.pgvector]  ◀── [model.embedding]
                                         Embeddings
                                         Documents
                                              │
                                              ▼
                                         Retriever ──▶ [retrieval.vector]
                                                        hardFilters: [tipo]

8.4 Connection with template 09 (HR)

Template 09-hr-policy-assistant (seen in M1) uses the simplest pipeline:

loader.pdf → ingest.chunker (strategy: recursive) → store.chroma

No explicit ingest.metadata because the chatbot does not need to filter by document type — everything is HR policy. Relevance filtering is done by the retriever via similarity.

When you add multiple departments or policy versions, you do need metadata:

{ "fields": ["departamento", "vigente_desde", "version"] }

9. When to use / when not to / alternatives

When to invest in a robust ingestion pipeline

The corpus has more than ~1000 documents and is growing.
Documents have domain-specific structure (contracts, technical manuals, regulations).
Users ask questions that require filtering by type/date/context.
Answer precision has regulatory or safety consequences (aviation, medicine, credit).

When not to over-engineer the pipeline

The corpus is small (<100 documents) and static: a RecursiveCharacterTextSplitter with chunkSize: 1000 is enough.
You are in prototype phase: first validate that RAG solves the problem; then optimize chunking.
Documents are continuous text without structure (novels, blog posts): semantic or fixed chunking works well.

Alternatives to the standard pipeline

Alternative	When to choose it	Tradeoff
Unstructured.io API	You need maximum quality without implementing your own parsing	Cost per call, external dependency
LlamaIndex SimpleDirectoryReader	Multiple file types in one directory	Less flexible for domain metadata
Apache Tika	Heterogeneous corpus with rare formats (DOCX, ODT, PPT)	Java as dependency
No chunking (full context)	Short documents (<4000 tokens) and LLM with large window	Does not scale; expensive in tokens
Fine-tuning instead of RAG	Very stable documents + very repetitive questions	Costly to update; no source traceability

10. Layer ③ explained: chunking with LangChain from scratch

Prerequisite: in M1 you learned what LangChain is, the Document object (page_content + metadata), loaders (TextLoader), and the loader → splitter → store pipeline. If you do not remember, read §11 of the M1 guide first (5 minutes). Here we only teach what is new in M2: LangChain text splitters and how to write a custom one for domain chunking.

This section bridges what you did by hand in the lab (solucion_scratch.py) and what you will see in production with LangChain (lab/solucion_framework.py). When you finish it, you should be able to write Approach A and Approach B of the lab, not just read them.

10.1 Bridge table: scratch → LangChain

What you did by hand (layer ②)	Equivalent piece in LangChain (layer ③)
`open(path).read()`	`TextLoader(path).load()` → list of `Document`
Your `Chunk` dataclass	`Document(page_content=..., metadata={...})`
`re.compile(r'^CLÁUSULA...', re.MULTILINE)`	Logic inside `split_text()` of a custom splitter
Loop `matches[i].start()` → `matches[i+1].start()`	Same algorithm, but encapsulated in `ClauseSplitter`
`clasificar_clausula(titulo)`	`_clasificar(titulo)` inside the custom splitter
`chunk.metadata["source"] = "contrato_muestra.txt"`	Parent `Document` metadata propagated in `split_documents()`
`print(json.dumps(chunk))`	`splitter.split_documents(docs)` → list ready for `Chroma.from_documents()`

Capa ② (scratch)                    Capa ③ (LangChain)
─────────────────                   ─────────────────────
texto = open(...).read()     →      docs = TextLoader(...).load()
regex + bucle manual         →      splitter.split_documents(docs)
dict metadata a mano         →      Document.metadata automático
script suelto                →      integración con vector stores

10.2 `RecursiveCharacterTextSplitter`: the recursive algorithm

It is LangChain's generic default splitter. It does not know your domain (clauses, ATA, articles); it only tries to cut text respecting separators from most to least semantic until each piece fits in chunk_size.

Installation: pip install langchain-text-splitters

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],  # orden: más semántico → menos
    chunk_size=1000,
    chunk_overlap=150,
    keep_separator=True,
)
chunks = splitter.create_documents([texto_largo])
# chunks[i] es un Document(page_content=..., metadata={})

The algorithm, step by step

Imagine text of 2500 characters and chunk_size=1000. The splitter works recursively on each fragment:

                    TEXTO (2500 chars)
                           │
              ¿Cabe en chunk_size=1000?  NO
                           │
         Prueba separador[0] = "\n\n" (párrafos)
                           │
              ┌────────────┴────────────┐
         Párrafo A (400)          Párrafo B (2100)
         ¿Cabe? SÍ → chunk 0      ¿Cabe? NO
                                        │
                         Prueba separador[1] = "\n" (líneas)
                                        │
                         ┌──────────────┴──────────────┐
                    Línea 1 (500)              Resto (1600)
                    ¿Cabe? SÍ → chunk 1        ¿Cabe? NO
                                                    │
                                    Prueba separador[2] = ". " (oraciones)
                                                    │
                                    ... y así hasta que cada trozo ≤ 1000

Algorithm rules:

Receives a text block and the separator list (most to least semantic).
Tries to split with the first separator in the list.
For each resulting sub-block:
- If len(sub_bloque) ≤ chunk_size → it is a candidate chunk.
- If len(sub_bloque) > chunk_size → recursion: return to step 2 with the next separator in the list.
If separators are exhausted, cut by characters (separator "" forces hard cut).
Apply chunk_overlap between consecutive chunks (sliding; see §4.6).

Concrete mini-example:

texto = (
    "Párrafo corto.\n\n"
    "Párrafo larguísimo que supera el límite. " * 30  # ~1500 chars
)

splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " "],
    chunk_size=500,
    chunk_overlap=0,
)
chunks = splitter.create_documents([texto])
# Resultado aproximado:
#   Chunk 0: "Párrafo corto."           ← cabía entero tras split por "\n\n"
#   Chunk 1: primeras oraciones del párrafo largo  ← el largo se partió por ". "
#   Chunk 2: oraciones siguientes...

Parameters you must understand

Parameter	What it does	Common gotcha
`separators`	Ordered list of preferred cuts	Order matters: `["\nCLÁUSULA ", "\n\n", "\n", " "]` prioritizes clauses over paragraphs
`chunk_size`	Maximum characters per chunk	Too small fragments excessively; too large fills the LLM window
`chunk_overlap`	Characters repeated between neighboring chunks	With domain separators (clauses), usually `0` — see §4.6
`keep_separator`	If `True`, separator stays at the start of the next chunk	With `"\nCLÁUSULA "` and `keep_separator=True`, each chunk starts with `CLÁUSULA N.`

`.create_documents()` vs `.split_documents()`

# Desde texto crudo (sin metadata de origen):
chunks = splitter.create_documents([texto])
# metadata vacía: {}

# Desde Documents que ya trajo un loader (con source, page, etc.):
from langchain_community.document_loaders import TextLoader
docs = TextLoader("contrato.txt").load()
chunks = splitter.split_documents(docs)
# cada chunk hereda metadata del Document padre (source, etc.)

For real ingestion, you almost always use split_documents() because the loader already added source and other fields. See §7.1 for the loader comparison.

10.3 Writing your own splitter: inherit from `TextSplitter`

When the domain has predictable structure (clauses, articles, ATA sections), a generic splitter is not enough: you need rich metadata (clausula_id, tipo, contrato) that you can only extract with domain regex. The solution is to inherit from TextSplitter.

The interface you must implement

from langchain_text_splitters import TextSplitter
from langchain_core.documents import Document

class MiSplitter(TextSplitter):
    def split_text(self, text: str) -> list[str]:
        """OBLIGATORIO: recibe texto, devuelve lista de strings."""
        ...

    def split_documents(self, documents: list[Document]) -> list[Document]:
        """OPCIONAL pero recomendado: override para metadata rica."""
        ...

Method	Input	Output	When used
`split_text(text)`	One string	`list[str]`	Base API; other methods call it internally
`split_documents(docs)`	`list[Document]`	`list[Document]`	Real pipeline: preserves and enriches metadata

Why override split_documents(): the default TextSplitter implementation calls split_text() and wraps each string in a Document with minimal metadata. If you only implement split_text(), you lose the chance to add clausula_id, tipo, etc. The override lets you return complete Document objects.

Minimal skeleton connected to the lab

class ClauseSplitter(TextSplitter):
    def split_text(self, text: str) -> list[str]:
        # Delega al método que construye Documents completos
        return [d.page_content for d in self._split_to_docs(text)]

    def _split_to_docs(self, text: str) -> list[Document]:
        matches = list(self._PATRON.finditer(text))
        docs = []
        for i, m in enumerate(matches):
            inicio = m.start()
            fin = matches[i + 1].start() if i + 1 < len(matches) else len(text)
            docs.append(Document(
                page_content=text[inicio:fin].strip(),
                metadata={
                    "clausula_id": int(m.group(1)),
                    "titulo": m.group(2).strip(),
                    "tipo": self._clasificar(m.group(2)),
                    # ...
                },
            ))
        return docs

    def split_documents(self, documents: list[Document]) -> list[Document]:
        all_docs = []
        for doc in documents:
            for chunk in self._split_to_docs(doc.page_content):
                # Preservar metadata del padre (source del loader)
                chunk.metadata["source"] = doc.metadata.get("source", "")
                all_docs.append(chunk)
        return all_docs

This is exactly the pattern of ClauseSplitter in lab/solucion_framework.py — the same regex logic as solucion_scratch.py, but packaged for the LangChain ecosystem.

10.4 Loader → splitter integration (full pipeline)

┌─────────────┐     load()      ┌──────────────────┐   split_documents()   ┌─────────────┐
│ TextLoader  │ ──────────────▶ │ list[Document]   │ ────────────────────▶ │ list[Document│
│ contrato.txt│                 │ metadata: source │                       │ chunks con  │
└─────────────┘                 └──────────────────┘                       │ metadata    │
                                                                           └─────────────┘

from langchain_community.document_loaders import TextLoader

loader = TextLoader("datos/contrato_muestra.txt")
docs = loader.load()
# docs[0].page_content = texto completo del archivo
# docs[0].metadata = {"source": "datos/contrato_muestra.txt"}

splitter = ClauseSplitter(contract_id="CSP-2024-0087", fecha="2024-01-15")
chunks = splitter.split_documents(docs)
# 13 Documents, cada uno con clausula_id, titulo, tipo, contrato, fecha, source

Where each ingestion framework fits (summary; detail in §7):

Framework	Role in this pipeline	Equivalent piece
LangChain `TextLoader`	Load the file	`open().read()` in scratch
LangChain `TextSplitter`	Split + metadata	Your regex loop in scratch
LlamaIndex `SimpleDirectoryReader`	Loader alternative; detects file type	Several LangChain loaders by hand
Unstructured	Advanced parsing of rich PDFs	Does not replace the splitter; goes before (better input text)

In M2 the focus is the splitter. You compared loaders in §7; in the lab we use TextLoader because the contract is already plain .txt.

10.5 Block-by-block walkthrough of `solucion_framework.py`

Open lab/solucion_framework.py and follow along with this section. The file has three blocks.

Block 1 — Approach A: `RecursiveCharacterTextSplitter`

splitter_a = RecursiveCharacterTextSplitter(
    separators=["\nCLÁUSULA ", "\n\n", "\n", " "],
    chunk_size=1200,
    chunk_overlap=0,
    keep_separator=True,
)
chunks_a = splitter_a.create_documents([texto_contrato])

Line / decision	What it does	Why
`"\nCLÁUSULA "` first	Tries to cut before each clause header	Leverages contract structure without custom regex
`chunk_size=1200`	Limit per chunk	If a clause exceeds 1200 chars, the algorithm falls back to the next separator (`\n\n`, `\n`, ) and splits into smaller pieces
`chunk_overlap=0`	No overlap	Clauses are autonomous units — see §4.6
`keep_separator=True`	Keeps `CLÁUSULA N.` at chunk start	Retriever returns identifiable context
`create_documents([texto])`	Splits raw text	No intermediate loader; metadata stays empty

Pedagogical limitation: Approach A does not produce clausula_id or tipo. It is a good baseline for comparison, not the production solution for contracts.

Block 2 — Approach B: custom `ClauseSplitter`

Component	Scratch equivalent	Function
`_PATRON` with `re.MULTILINE`	`_PATRON_CLAUSULA`	Detect only line-start headers
Loop `matches[i].start()` → `fin`	Same loop in `parsear_clausulas()`	Delimit each clause's text
`_clasificar(titulo)`	`clasificar_clausula(titulo)`	Infer `tipo` from keywords
`split_documents([doc_base])`	`main()` that reads and splits	Integration with `source` metadata

doc_base = Document(
    page_content=texto_contrato,
    metadata={"source": "contrato_muestra.txt"},
)
chunks_b = splitter_b.split_documents([doc_base])
# Esperado: 13 chunks, mismos metadatos que solucion_scratch.py

Block 3 — Vector store integration (commented out)

# vectordb = Chroma.from_documents(documents=chunks_b, embedding=OpenAIEmbeddings(), ...)
# results = vectordb.similarity_search(query="...", k=3, filter={"tipo": "responsabilidad"})

This block closes the loader → splitter → store pipeline from §8.3. Approach B chunks carry tipo in metadata, enabling the hard filter from §5: only chunks with tipo="responsabilidad" compete in search.

10.6 When to use generic vs domain custom splitter

Situation	Recommended splitter	Reason
Quick prototype, unstructured text	`RecursiveCharacterTextSplitter`	Zero custom code; enough to validate RAG
HR policies (paragraphs)	`RecursiveCharacterTextSplitter` with `separators=["\n\n", "\n", ". "]`	Generic paragraph structure — see §4.2
Contracts, regulations, ATA manuals	Custom splitter (`ClauseSplitter`, `ATASplitter`, etc.)	Domain metadata + zero false positives
PDFs with complex tables	Unstructured before + generic or custom splitter after	Better input parsing; see §7.3

Gotchas that appear in production

1. keep_separator and the first chunk

With keep_separator=True and separator "\nCLÁUSULA ", text before the first clause (contract header, date, parties) may remain as a loose chunk 0. In real contracts, discard or merge that preface in post-processing.

2. Overlap in domain chunks

With by-clause, overlap is usually 0: repeating the end of Clause 3 at the start of Clause 4 adds no useful context and duplicates embeddings. Reserve overlap for continuous narrative text (§4.6).

3. Metadata that gets lost

# ❌ Solo split_text — metadata del padre no se propaga bien
chunks = splitter.split_text(doc.page_content)

# ✅ split_documents — preserva source y enriquece
chunks = splitter.split_documents([doc])

If you only call split_text() and build Document manually forgetting doc.metadata, you lose source and any field the loader added. The retriever cannot filter or cite the source file.

4. RecursiveCharacterTextSplitter without ^ anchor

The separator "\nCLÁUSULA " does not distinguish headers from references like "conforme a la Cláusula 9..." if that reference starts after a line break. That is why Approach A may generate spurious chunks; Approach B with ^ in the regex does not.

10.7 Guided exercise: write your version before looking at the solution

Follow this order in the lab:

Finish layer ② (solucion_scratch.py) and verify 13 chunks against expected.md.
With pip available, install langchain-text-splitters langchain-community.
Write Approach A with RecursiveCharacterTextSplitter — print how many chunks it produces and compare with 13.
Write Approach B: ClauseSplitter class inheriting from TextSplitter.
Compare your code with lab/solucion_framework.py line by line.

11. Checkpoint

You know it if you can…

Explain the difference between a selectable PDF and a scanned one, and when to use OCR.
Choose the correct chunking strategy given a document type (contract, technical manual, article, CSV).
Calculate how many chunks a 5000-char text produces with chunkSize: 1000 and overlap: 150.
Explain why re.MULTILINE with ^ avoids false positives in the clause chunker.
Define what metadata fields you would add to template 08 (manufacturing) and what filters they serve.
Compare LangChain loaders vs. Unstructured.io for a PDF with complex tables.
Trace the full loader → chunker → metadata → store pipeline for template 02 (banking).
Explain the recursive algorithm of RecursiveCharacterTextSplitter and when to use it vs a custom splitter.
Implement split_text() and split_documents() when inheriting from TextSplitter.
Identify why metadata is lost if you only use split_text() without propagating the parent document's metadata.

What to review if it is still unclear

Full section 4 if chunking is still confusing.
Section 5 if you do not understand how hard filters use metadata.
Section 10 if layer ③ (LangChain splitters) feels abrupt.
Document and basic loaders: M1 §11.
docs/02-node-catalog.md §loaders and §ingestion in the ragorbit repo.
READMEs of examples/05-legal-contract-review/ and examples/08-manufacturing-maintenance-rag/.

Next step

Do the lab (lab/enunciado.md): chunk the contract with layer ② (solucion_scratch.py) and verify 13 chunks against lab/expected.md.
Follow the layer ③ guided task: write Approach A and the Approach B ClauseSplitter using §10, and compare with solucion_framework.py.
Solve exercises 28–30 on LangChain splitters.

When you finish, continue with M3 — Embeddings and Vector Stores (03-embeddings-y-stores/).

← Back to course View on GitHub →

Data ingestion

M2 · Data ingestion — loader + ingest

Table of contents

1. The ingestion problem

2. Data sources and loaders

2.1 The six loader types in RAGorbit

2.2 When to OCR and when not to

2.3 loader.sql: converting rows into documents

2.4 Connection with templates

3. Parsing: from raw format to structured text

3.1 PDF parsing under the hood

3.2 Tabular parsing

3.3 Web parsing

4. Chunking in depth

4.1 Strategy 1 — Fixed chunking (fixed size)

4.2 Strategy 2 — Recursive chunking (hierarchical separators)

4.3 Strategy 3 — Semantic chunking (by semantic similarity)

4.4 Strategy 4 — By-layout chunking (visual/HTML structure)

4.5 Strategy 5 — By-clause/section chunking (domain-based)

4.6 The overlap parameter

4.7 Chunking strategy comparison

5. Metadata and its role in hard filters

5.1 What metadata is in chunks

5.2 Hard filters vs. soft filters

5.3 Metadata fields by domain

5.4 How the ingest.metadata node produces these fields

5.5 Metadata and reproducibility

6. Multimodal: tables and diagrams

6.1 The problem with rich PDFs

6.2 Tables → JSON

6.3 Diagrams → vision → text

6.4 sectionScheme: ATA

6.5 Limitations and when to scale

7. Ingestion framework comparison

7.1 LangChain loaders

7.2 LlamaIndex readers

7.3 Unstructured.io

7.4 When to use each

8. Full pipelining in RAGorbit

8.1 ingest.chunker node

8.2 ingest.metadata node

8.3 Typical pipeline

8.4 Connection with template 09 (HR)

9. When to use / when not to / alternatives

When to invest in a robust ingestion pipeline

When not to over-engineer the pipeline

Alternatives to the standard pipeline

10. Layer ③ explained: chunking with LangChain from scratch

10.1 Bridge table: scratch → LangChain

10.2 RecursiveCharacterTextSplitter: the recursive algorithm

The algorithm, step by step

Parameters you must understand

.create_documents() vs .split_documents()

10.3 Writing your own splitter: inherit from TextSplitter

The interface you must implement

Minimal skeleton connected to the lab

10.4 Loader → splitter integration (full pipeline)

10.5 Block-by-block walkthrough of solucion_framework.py

Block 1 — Approach A: RecursiveCharacterTextSplitter

Block 2 — Approach B: custom ClauseSplitter

Block 3 — Vector store integration (commented out)

10.6 When to use generic vs domain custom splitter

Gotchas that appear in production

10.7 Guided exercise: write your version before looking at the solution

11. Checkpoint

You know it if you can…

What to review if it is still unclear

Next step

M2 · Data ingestion — `loader` + `ingest`

5.4 How the `ingest.metadata` node produces these fields

8.1 `ingest.chunker` node

8.2 `ingest.metadata` node

10.2 `RecursiveCharacterTextSplitter`: the recursive algorithm

`.create_documents()` vs `.split_documents()`

10.3 Writing your own splitter: inherit from `TextSplitter`

10.5 Block-by-block walkthrough of `solucion_framework.py`

Block 1 — Approach A: `RecursiveCharacterTextSplitter`

Block 2 — Approach B: custom `ClauseSplitter`