Multimodal — voice & vision
Module 10 · Multimodal — Voice, vision, and generation (io.stt, model.vision)
Prerequisite: M1–M6 (LLMs, ingestion, retrieval, generation with citations, agents). M2 covers
loader.multimodalin ingestion; this module goes deeper into multimodal input and output.RAGorbit nodes:
io.stt,model.vision,loader.multimodal,model.llm(multimodal)Anchor templates:
07-telecom-callcenter-copilot(STT streaming),04-insurance-claims(vision in claims),08-manufacturing-maintenance-rag(AMM + ATA diagrams)
1. What "multimodal" means and why it matters in RAG
A multimodal system processes more than one type of signal: text, voice, image, video, scanned tables. In RAG and agents, the goal is not to "look pretty" — it is to convert those signals into structured text that the rest of the pipeline (retrieval, generation, rules) already knows how to handle.
┌─────────────────────────────────────────────────────────────────────┐
│ TYPICAL MULTIMODAL PIPELINE │
│ │
│ Audio (.wav) ──▶ [STT] ──▶ text │
│ Image (.jpg) ──▶ [Vision] ──▶ description + metadata │
│ PDF with photos ──▶ [loader.multimodal] ──▶ enriched chunks │
│ │
│ text + descriptions + metadata │
│ │ │
│ ▼ │
│ [Retrieval / RAG / Agent] │
│ │ │
│ ▼ │
│ Text response ◀─┴──▶ [TTS] ──▶ audio (optional) │
│ └──▶ [DALL·E/SDXL] ──▶ image (optional) │
└─────────────────────────────────────────────────────────────────────┘
1.1 When to use multimodal?
Use multimodal input when:
- The user speaks (call center, voice assistant, dictation on the ramp).
- Documents include diagrams, photos, or tables whose information is not in OCR text.
- You need to classify damage in photos (insurance, industrial inspection).
- The input channel is photo + question (WhatsApp with an invoice image).
Do NOT use multimodal when:
- All relevant content is already in clean text (digital PDF, Markdown, SQL).
- Latency is critical (<500 ms) and you can ask the user to type instead of speaking or photographing.
- Cost per query matters and the visual signal does not add new information (decorative logo in a PDF).
1.2 The RAGorbit pattern: convert → retrieve → cite
RAGorbit does not index pixels or audio waveforms directly in the vector store (except specialized multimodal embeddings). The dominant pattern is:
- Convert audio/image to text (STT, vision).
- Enrich with metadata (ATA chapter, damage type, language).
- Retrieve with fused text + hard filters.
- Generate with mandatory citations (M5).
2. Multimodal challenges in production
| Challenge | What it means | Mitigation |
|---|---|---|
| Alignment | Voice and image must refer to the same incident | Session IDs; timestamp; merge into a single query |
| Latency | STT streaming + vision + RAG can exceed 3 s | STT streaming (Deepgram); async vision; description cache |
| Cost | Vision ~$0.01–0.03/img; Whisper API per minute | Offline batch with local Whisper; compress images |
| Languages | Multilingual Whisper ≠ all accents | language hint; fine-tune or regional model |
| Formats | Audio codecs, HEIC, scanned PDF | Normalize at ingestion (loader.multimodal) |
| Visual hallucination | The model "sees" damage that does not exist | Citations to the AMM; HITL on WARNING; rules on metadata |
| PII in audio/image | Voices, license plates, faces in photos | Redaction pre-STT; blur in ingestion pipeline |
Typical latency (call center copilot, template 07):
Deepgram STT (streaming) ~300–800 ms
model.intent ~10 ms
retrieval + rerank ~200–400 ms
logic.prompt + citations ~500–1500 ms
─────────────────────────────────────────
Target total: < 1.5 s from end of phrase
3. STT — Speech-to-Text and the io.stt node
3.1 What it does
STT converts audio to text. In RAGorbit, the io.stt node produces Message fragments in streaming as audio arrives — it does not wait for the user to finish speaking.
See full reference: referencia/catalogo-nodos.md §io.stt.
{
"type": "io.stt",
"config": {
"provider": "deepgram",
"language": "es"
}
}
3.2 Streaming vs batch
| Mode | When | Example |
|---|---|---|
| Streaming | Real time, copilot, IVR | Template 07: call fragment → suggestion in panel |
| Batch | Full recording, post-processing | Technician voice note on ramp → transcript → RAG |
In batch, local Whisper is the reference option (open-weights, no per-token cost). In streaming, Deepgram Nova-2 or Amazon Transcribe Streaming dominate on latency.
3.3 When NOT to use io.stt
- If you already have the transcript (
.txtfile, subtitles, CRM). - If audio is processed in overnight batch — use Whisper as an
io.batchortool.functionstep.
4. Whisper — the reference model for offline STT
Whisper (OpenAI, open-weights) is the standard for offline multilingual transcription. Models: tiny, base, small, medium, large — larger = better quality, slower.
# Conceptual — see §10 for full implementation
import whisper
model = whisper.load_model("base")
result = model.transcribe("nota_tecnico.wav", language="es")
print(result["text"])
4.1 Strengths
- Multilingual without fine-tuning (99 languages).
- Open-weights — runs locally via Hugging Face, Ollama,
pip install openai-whisper. - Segments with timestamps — useful for aligning with video or review UI.
- Zero cost per token if it runs on your GPU/CPU.
4.2 Limitations
- Not true streaming — processes the full file (or large chunks).
- Latency on CPU: ~10–30× real time with the
basemodel. - Background noise on ramp/airport degrades quality.
4.3 Whisper vs Deepgram
| Whisper (local) | Deepgram Nova-2 | |
|---|---|---|
| Latency | High (batch) | Low (<300 ms streaming) |
| Cost | Fixed GPU/CPU | ~$0.0043/min |
| Privacy | Data stays local | Audio goes to API |
| Languages | 99 | 36+ with excellent ES |
| Ideal case | Batch, offline, private | Real-time call center (template 07) |
See also: referencia/glosario.md §Whisper, referencia/glosario.md §STT.
5. Vision — describing images, diagrams, and tables→JSON
5.1 The model.vision node
What it does: multimodal model that describes images, diagrams, and tables as text for the RAG pipeline. Produces Model and connects to loader.multimodal.
{
"type": "model.vision",
"config": {
"model": "anthropic:claude-opus-4-8",
"apiKeyRef": "ANTHROPIC_API_KEY"
}
}
Reference: referencia/catalogo-nodos.md §model.vision.
5.2 Use cases in the templates
Template 04 (insurance): vehicle damage photos → description → logic.rules + logic.structured with cited clause.
Template 08 (manufacturing): hydraulic diagrams from the AMM → text describing components → indexable chunks with sectionScheme: ATA.
AMM PDF
│
▼
loader.multimodal (extractTables: true, describeImages: true)
│ ▲
│ │ Model
└──── model.vision ──┘
│
▼
"Figure 32-11-05: MLG retraction actuator with inspection points A, B, C..."
│
▼
ingest.chunker → store.pgvector → retrieval with hardFilters (aircraft_type, ata_chapter)
5.3 Scanned tables → JSON
loader.multimodal with extractTables: true converts tolerance, deductible, or torque tables into structured JSON — not just plain text. That enables deterministic logic.rules on numeric values (M5).
5.4 When NOT to use vision
- PDF with selectable text (use
loader.pdfdirectly). - Decorative images with no technical information.
- When cost per document (hundreds of pages with figures) exceeds budget — consider OCR + layout parser (Unstructured) first.
5.5 Competing vision models
| Model | Strength | Weakness |
|---|---|---|
| GPT-4o | Excellent on diagrams + JSON | API cost |
| Claude Opus/GPT-4o | Reasoning over technical figures | Latency |
| Gemini Pro Vision | Long context, video | Google ecosystem |
| LLaVA / Qwen-VL | Local, open-weights | Lower quality on dense tables |
| IBM Granite Vision | Enterprise, watsonx | Integration curve |
| Pixtral (Mistral) | Good EU balance | Smaller ecosystem |
6. Image and audio generation (multimodal output)
So far we covered multimodal input (audio→text, image→text). Generation converts text into image or audio — useful for spoken responses, illustrations, or IVR.
6.1 Image generation
| Model | Type | When to use |
|---|---|---|
| DALL·E 3 | OpenAI API | Illustrations, mockups, conceptual diagrams |
| Stable Diffusion XL | Local/open | High volume, LoRA control, no API |
| Sora | Video (OpenAI) | Short clips, prototypes — high cost and latency |
| Midjourney | API/discord | Artistic quality |
In production RAG, image generation is rare — the value is in retrieving and citing real documents. It is used more in creative assistants or to generate explanatory diagrams ad hoc.
6.2 TTS — Text-to-Speech
Converts the LLM response into synthetic audio:
# Conceptual
client.audio.speech.create(model="tts-1", voice="nova", input=respuesta)
When to use TTS:
- Bidirectional voice bot: STT (input) + LLM + TTS (output).
- IVR that reads simplified policies.
- Accessibility.
Alternatives: ElevenLabs (natural voices), Amazon Polly, Google Cloud TTS, Coqui TTS (local).
6.3 Full voice pipeline
User speaks ──▶ io.stt ──▶ text ──▶ RAG/agent ──▶ text response ──▶ TTS ──▶ audio
Template 07 implements only the left half (STT → copilot for human agent). A closed voice bot would add TTS at the end.
7. Multimodal embeddings and multimodal vector DB
7.1 The problem
Text embeddings (M3) do not represent images directly. Multimodal embeddings project text and image into the same vector space (or aligned spaces):
- CLIP (OpenAI): image ↔ text
- ImageBind (Meta): image, audio, text, depth…
- Cohere embed-multilingual-v3: text + image in API
7.2 Use cases
| Case | How it works |
|---|---|
| Search "photos similar to this description" | Embed text query + compare with image embeddings |
| Cross-modal search | "Find diagrams related to this damage photo" |
| Indexing in Weaviate/Milvus multimodal | Native modules for image + text |
7.3 When to use multimodal embeddings vs vision→text?
| Approach | Advantage | Disadvantage |
|---|---|---|
| Vision → text → embed text (RAGorbit pattern) | Compatible with any vector store; citations on text | Loses fine detail; vision cost at ingestion |
| Embed image directly | Visual similarity search | Less interpretable; hard to cite; specialized stores |
For regulated domains (MRO, insurance, healthcare), vision→text→RAG with citations is the preferred pattern because you can audit the indexed text.
7.4 Vector DB with multimodal support
Weaviate and Milvus offer modules for image+text. Chroma and pgvector in RAGorbit index mainly text — the natural path is to describe images at ingestion (loader.multimodal).
8. Technology comparison — quick decision
8.1 STT
| Provider | Latency | Streaming | Privacy | Best for |
|---|---|---|---|---|
| Local Whisper | High | No | ✅ Total | Batch, offline, lab |
| Deepgram | Very low | ✅ | API | Call center (07) |
| OpenAI Whisper API | Medium | No | API | Rapid prototypes |
| Amazon Transcribe | Medium | ✅ | AWS | AWS stack |
8.2 Vision
| Provider | Technical quality | Local | Best for |
|---|---|---|---|
| GPT-4o | High | No | Diagrams, JSON, insurance |
| Claude Opus | High | No | Technical manuals (08) |
| LLaVA-1.6 | Medium | ✅ | Prototypes without API |
| watsonx Granite Vision | High | Hybrid | IBM enterprise |
8.3 Open platforms (HF / Ollama / watsonx)
- Hugging Face: download Whisper, LLaVA, CLIP, SDXL;
transformers+pipeline. - Ollama:
ollama run llavafor local vision; Whisper via bindings. - watsonx / Granite: enterprise models with governance; vision + LLM under one contract.
See extended table: referencia/tecnologias-comparadas.md.
9. Connection with industry templates
Template 07 · Telecom Copilot — STT streaming
Call audio → io.stt (Deepgram) → model.intent → query.rewrite → retrieval.router → io.panel
- STT feeds partial fragments;
model.intentdiscards greetings. - Target latency: < 1.5 s from end of phrase.
- See
examples/07-telecom-callcenter-copilot/README.md.
Template 04 · Insurance — vision in claims
io.batch → loader.multimodal (photos + policy) → model.vision → logic.rules → logic.structured
- Damage photos → description → deductible rules → JSON with
clausula_aplicada. - See
examples/04-insurance-claims/README.md.
Template 08 · Manufacturing MRO — multimodal AMM
model.vision → loader.multimodal (sectionScheme: ATA) → retrieval.vector (hardFilters) → hitl.escalate
- AMM diagrams → text; filters by
aircraft_typeandata_chapter. - WARNING/CAUTION → escalation to inspector.
- See
examples/08-manufacturing-maintenance-rag/README.md.
10. Layer ③ explained: Whisper, vision, and generation from scratch
Prerequisite: you have implemented layer ② of the workshop (
lab/solucion_scratch.py) or understand each mock piece. Read this section in full before writinglab/solucion_framework.py.Environment: the study machine has no
pipor network. The goal is that, when you havepip install openai-whisper openai langchain langchain-openaiand API keys, you can write the framework solution yourself.
10.1 Bridge table: scratch mocks → real APIs
| What you simulated with mocks (layer ②) | Real API (layer ③) | Where in the lab |
|---|---|---|
Read fixed transcript from audio_notificacion.json |
whisper.load_model("base").transcribe(audio_path) |
transcribe_with_whisper() |
Read fixed description from foto_fuga.json |
client.chat.completions.create with base64 image (GPT-4o) |
describe_image_with_vision() |
BoW + cosine on amm_chunks.json |
FAISS + OpenAIEmbeddings + retriever with filter |
build_retriever() |
Deterministic template generate_answer() |
ChatOpenAI + prompt with mandatory citations |
generate_with_llm() |
| — (not implemented in scratch) | client.audio.speech.create (TTS) |
synthesize_tts() |
| — (not implemented in scratch) | client.images.generate (DALL·E 3) |
generate_image_dalle() |
Mental model: in scratch, mock JSON is the STT/vision output. In production you replace only those two functions; retrieval and generation can migrate gradually (real embeddings first, then real LLM).
10.2 Whisper from scratch — transcribe_with_whisper
Installation and model loading
pip install openai-whisper
# Requires ffmpeg on the system: brew install ffmpeg (macOS)
import whisper
model = whisper.load_model("base") # tiny|base|small|medium|large
result = model.transcribe(
"nota_tecnico.wav",
language="es", # hint de idioma mejora precisión
fp16=False, # obligatorio en CPU/MPS
)
text = result["text"]
segments = result["segments"] # [{start, end, text}, ...]
What it does internally (simplified):
- Converts audio to spectrogram (mel bins).
- Transformer encoder processes the audio.
- Decoder generates text tokens autoregressively.
- Returns text + timestamps per segment.
Choosing model size
| Model | Params | VRAM ~ | ES quality | CPU speed |
|---|---|---|---|---|
| tiny | 39M | 1 GB | Basic | Fast |
| base | 74M | 1 GB | Good | Medium |
| small | 244M | 2 GB | Very good | Slow |
| medium | 769M | 5 GB | Excellent | Very slow |
For technician notes on the ramp (noise, jargon), small is the minimum recommended for production batch.
Whisper gotchas
- Do not confuse with streaming
io.stt— Whisper processes files; for call center use Deepgram. language="es"reduces errors vs automatic detection on short audio.- Hallucinations in silence — Whisper can invent text on empty audio; validate
no_speech_probin segments. - First run downloads the model (~150 MB for
base) — plan Docker cache.
10.3 Vision from scratch — describe_image_with_vision
OpenAI GPT-4o pattern (used by solucion_framework.py)
import base64
from openai import OpenAI
client = OpenAI()
with open("foto_fuga.jpg", "rb") as f:
b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe daño aeronáutico. Responde JSON."},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
],
}],
response_format={"type": "json_object"},
temperature=0.0,
)
vision = json.loads(response.choices[0].message.content)
LangChain pattern (equivalent)
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
llm = ChatOpenAI(model="gpt-4o", temperature=0)
msg = HumanMessage(content=[
{"type": "text", "text": "Describe..."},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
])
result = llm.invoke([msg])
Local pattern — LLaVA via Ollama
ollama pull llava
ollama run llava "Describe esta imagen de tren de aterrizaje" --image foto.jpg
Vision gotchas
- Resolution: resize to max 2048px — huge images increase cost without gaining detail.
- JSON mode (
response_format) reduces parsing errors in ATA metadata. - Damage hallucination — in insurance/MRO, combine with
logic.citationson documents, not just the visual description. - PII — photos may have license plates, faces; consider prior blur.
10.4 Retrieval and generation — LangChain over AMM chunks
Scratch uses BoW; in framework:
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
docs = [Document(page_content=c["text"], metadata=c["metadata"]) for c in chunks]
store = FAISS.from_documents(docs, OpenAIEmbeddings(model="text-embedding-3-small"))
retriever = store.as_retriever(search_kwargs={"k": 2, "filter": {"aircraft_type": "A320"}})
Generation with citations:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.0)
# Prompt: "Responde SOLO con evidencia del AMM. Incluye citations: [...]"
10.5 TTS and DALL·E — output generation
TTS (OpenAI)
response = client.audio.speech.create(
model="tts-1", # o tts-1-hd para mayor calidad
voice="nova", # alloy, echo, fable, onyx, nova, shimmer
input="Procedimiento AMM 32-11-00: inspeccionar actuador...",
)
response.stream_to_file("respuesta.mp3")
When to use / NOT:
- ✅ Voice bot, IVR, accessibility.
- ❌ Copilot for human agent (template 07) — the agent reads the panel, no audio needed.
DALL·E 3
result = client.images.generate(
model="dall-e-3",
prompt="Diagrama técnico esquemático de actuador MLG A320, estilo manual mantenimiento",
size="1024x1024",
quality="standard",
)
url = result.data[0].url
When to use / NOT:
- ✅ Training material, conceptual illustration.
- ❌ Regulatory evidence — never replaces a real photo or cited AMM document.
10.6 Block-by-block walkthrough of lab/solucion_framework.py
Open lab/solucion_framework.py and follow this map:
Block 1 — Chunk loading (lines ~20–30)
Identical to scratch: amm_chunks.json feeds the retriever. No surprises.
Block 2 — transcribe_with_whisper (lines ~35–60)
Scratch bridge: transcribe_audio() reads JSON → here whisper.transcribe() reads .wav.
Detail: fp16=False on CPU; language="es" for notes in Spanish.
Block 3 — describe_image_with_vision (lines ~65–110)
Scratch bridge: describe_image() reads JSON → here GPT-4o on base64.
Detail: response_format={"type": "json_object"} for structured ATA metadata.
Block 4 — build_retriever + retrieve_with_langchain (lines ~115–155)
Scratch bridge: retrieve_chunks() BoW → here FAISS with OpenAI embeddings.
Detail: filter {"aircraft_type": "A320", "ata_chapter": "32"} replicates scratch hard-filters.
Block 5 — generate_with_llm (lines ~160–195)
Scratch bridge: generate_answer() template → here ChatOpenAI with citations prompt.
Block 6 — TTS and DALL·E (lines ~200–240)
No scratch equivalent — they are optional multimodal output.
Block 7 — run_multimodal_pipeline_real (lines ~245–290)
Orchestrates everything with use_mocks_if_missing=True: if there is no .wav/.jpg, falls back to lab JSON (useful for testing without recording real audio).
10.7 When to use each technology — final gotchas
| Situation | Use | Avoid |
|---|---|---|
| Real-time call center | Deepgram (io.stt) |
Whisper batch |
| Voice notes in overnight batch | Local Whisper | Deepgram (cost/min) |
| Claim photos with JSON | GPT-4o + response_format |
LLaVA if you need damage precision |
| 1000+ page technical manuals | loader.multimodal at ingestion |
Vision on every query |
| Full spoken bot | STT + LLM + TTS | Text only if channel is chat |
| Regulatory evidence | Vision→text→RAG with citations | DALL·E as "proof" |
Production gotchas:
- Merge signals before retrieving — do not run two separate RAG passes (voice vs image); combine transcript + description into one query.
- Image description cache — the same photo should not go through vision twice; store hash→description.
- Chain timeouts — STT + vision + RAG can exceed 10 s; parallelize STT and vision if they are independent.
- Audio formats — normalize to 16 kHz mono WAV before Whisper; avoid exotic codecs.
- Vision cost in batch ingestion — template 04 with 500 photos/day × $0.02 ≈ $10/day on vision alone.
10.8 Checklist before writing your solucion_framework.py
- Does
transcribe_with_whisperuselanguage="es"andfp16=Falseon CPU? - Does
describe_image_with_visionrequest JSON withaircraft_type,ata_chapter,severity_hint? - Does the FAISS retriever apply filter
aircraft_type+ata_chapter? - Does the generation prompt require non-empty
citationsor "no determinable"? - Does
run_multimodal_pipeline_realdegrade to mocks if binary files are missing? - (Optional) Does
synthesize_ttsgenerate audio for the response? - (Optional) Do you understand why DALL·E does not replace real photographic evidence?
Next step: open lab/enunciado.md (Part B) and try to write the file yourself before looking at solucion_framework.py.
11. RAGorbit nodes in this module
io.stt
Input port: Audio (stream)
Output port: Message (text fragments)
Config: provider: deepgram, language: es. See template 07.
model.vision
Output port: Model → (connects to loader.multimodal or downstream logic)
Config: model: anthropic:claude-opus-4-8 or openai:gpt-4o.
loader.multimodal
Ports: Documents → Documents (enriched)
Model input: model.vision (optional)
Config: extractTables: true, describeImages: true, sectionScheme: ATA.
12. Checkpoint — You know it if you can…
- Explain why RAGorbit converts audio/image to text before indexing (in the dominant pattern).
- Differentiate STT streaming (
io.stt/Deepgram) vs STT batch (local Whisper). - Describe what
model.visionadds in templates 04 and 08. - Justify when to use vision→text vs direct multimodal embeddings.
- Name alternatives to Whisper, GPT-4o, and DALL·E with one trade-off each.
- Implement a scratch pipeline that merges mock transcript + mock description and retrieves with hard filters.
- Map each scratch mock function to its real API (table §10.1).
- Explain why
citationsremains mandatory even when input is multimodal. - Read template 07
flow.jsonand identify whereio.sttenters. - Read template 08
flow.jsonand identifymodel.vision+loader.multimodal.
If you cannot: review §3–§5 (STT and vision), §10 (framework), and lab/enunciado.md. Compare the three anchor templates in examples/.