Landscape: processes
Production process landscape — orchestration, serving, data, and deployment
RAGorbit course reference. Vendor-neutral map of the process ecosystem around a RAG or agentic system in production. Complements (does not replace) Module 9 — Production & Security, which already covered the four RAGorbit
deploymentTargetvalues, guardrails, observability, and basicio.*nodes.Audience: Python programmers who completed M9 and want to know the full market of orchestration, serving/inference, data pipelines, and deployment — not just Temporal, Kafka, and FastAPI.
Prerequisites: M9 §5–§6 (IO and deployment targets),
tecnologias-comparadas.md§14 and §12 (observability).
Introduction: a production AI system is more than the LLM
In M6–M8 you built the cognitive core: retrieval, ReAct agents, tools, and MCP. In M9 you learned to wrap that core with guardrails, audit, and the correct input node (io.input, io.event-source, io.trigger, io.batch). But in a real company, four process layers coexist around that graph, and rarely does a single framework cover them all:
┌─────────────────────────────────────────────────────────────────────────────┐
│ LAYER 4 — DEPLOYMENT / RUNTIME │
│ Docker, K8s, Lambda, Cloud Run, Modal, LLM gateways (LiteLLM, Portkey…) │
├─────────────────────────────────────────────────────────────────────────────┤
│ LAYER 3 — DATA PIPELINES / INGESTION │
│ Kafka, Spark, dbt, Flink, CDC, vector store reindexing │
├─────────────────────────────────────────────────────────────────────────────┤
│ LAYER 2 — MODEL SERVING / INFERENCE │
│ vLLM, TGI, managed APIs, TEI (embeddings), batching, quantization │
├─────────────────────────────────────────────────────────────────────────────┤
│ LAYER 1 — WORKFLOW / DURABLE AGENT ORCHESTRATION │
│ Temporal, Prefect, Airflow, queues + DB state, cron + batch │
├─────────────────────────────────────────────────────────────────────────────┤
│ RAGORBIT CORE — graph: retrieval + agent + guardrails + observability │
└─────────────────────────────────────────────────────────────────────────────┘
Mental model: the RAGorbit graph answers "what does the agent do with each request". The four layers answer "how the request arrives, where the model lives, where chunks come from, and on which machine everything runs".
Summary table of the four layers
| Layer | Question it answers | Representative examples | Typical RAGorbit node |
|---|---|---|---|
| 1. Orchestration | How do I coordinate steps that last seconds, hours, or days? | Temporal, Prefect, Kafka + Postgres | io.trigger, io.event-source |
| 2. Serving / inference | Where do I run the LLM and embeddings? | vLLM, Bedrock, TEI, Groq | model.llm, model.embedding |
| 3. Data pipelines | How do I index and update knowledge? | Spark, dbt, Flink, cron batch | io.batch, loader.*, ingest.* |
| 4. Deployment / runtime | On what infrastructure does the service run? | K8s, Cloud Run, LiteLLM gateway | deploymentTarget of io.* node |
Timing note (2025/2026): the LLM serving and gateway market evolves every quarter. The principles in this document (durability vs latency, batch vs streaming, self-host vs API) are stable; concrete names and benchmarks change — verify official documentation before deciding.
1. Workflow / durable agent orchestration
Orchestration answers: "who executes which step, when, with what retries, and what happens if the server restarts mid-flow?"
In M9 you already saw Temporal (io.trigger) vs Kafka + DB state (io.event-source) and cron + batch (io.batch). This section expands the full market catalog.
1.1 Comparative map of orchestrators
| Tool | Mental model | Durability | Best for | When NOT |
|---|---|---|---|---|
| Temporal | Code as workflow; durable event history | High — survives restarts, days/weeks | Long processes, HITL, compensations (sagas), durable cron | Second-scale events, massive stateless volume |
| Prefect | Python flows with @flow / @task; modern UI |
Medium-high (with Prefect Cloud or server) | Data + ML pipelines, Python-first teams | Workflows with complex human signals (Temporal wins) |
| Dagster | Assets as unit (tables, models, indexes) | Medium-high | Data pipelines with lineage, RAG index as asset | Real-time chat, sub-second latency |
| Apache Airflow | Declarative DAGs (Python or YAML); central scheduler | Medium (retries per task) | Classic ETL, nightly jobs, batch dependencies | Streaming, interactive agents, low latency |
| Flyte | Typed workflows on K8s; reproducibility | High on K8s | ML training + batch inference at scale | Teams without K8s, fast prototypes |
| Argo Workflows | Native K8s DAGs (CRD) | High on K8s | Containerized pipelines in existing cluster | Outside K8s, teams without cluster ops |
| Kestra | Declarative orchestrator (YAML); UI + plugins | Medium-high | Teams preferring YAML over Python code | Very dynamic agent logic at runtime |
| Queue + workers (Celery, RQ, Kafka consumer) | Message → stateless worker → DB state | Medium (depends on idempotency) | High throughput, fan-out, processing < minutes | Multi-day processes without orchestrator on top |
| Cron + script | crontab / K8s CronJob |
Low | Nightly indexing, short idempotent jobs | Anything with HITL or complex state |
1.2 Per-tool sheet (catalog style)
Temporal
What it does: Durable workflow engine. Workflow code re-executes from event history after a failure; activities (external calls) have native retry, timeout, and compensation. Supports day-long timers, human signals (signal), and cron.
When to use: Multi-day banking onboarding, medical approvals with waits, any flow where hitl.escalate implies hour-long pauses and the process must survive deploys.
When NOT to use: Fan-out of 50,000 events/hour with processing < 30 s (template 10 — use io.event-source). A nightly reindex cron (use io.batch).
Alternatives: AWS Step Functions (AWS vendor lock-in), Prefect with manual pauses (less robust for weeks), Cadence (Temporal open-source predecessor).
RAGorbit: io.trigger → deploymentTarget: temporal.
Prefect
What it does: Python-first orchestrator. You define flows with decorators; Prefect manages scheduling, retries, logging, and an execution UI. Natural integration with ML pipelines and ingest tasks.
When to use: Reindex the vector store nightly with clear steps (load → chunk → embed → upsert), scheduled RAGAS evaluations, data sync from APIs.
When NOT to use: Workflows with interactive human signals lasting days (Temporal). Real-time event streaming.
Alternatives: Dagster (if data lineage is priority), Airflow (if you already have it in the company), simple cron (if the pipeline fits in one script).
Dagster
What it does: Orchestrator centered on software-defined assets — each table, index, or model is an asset with explicit dependencies. Excellent lineage: "this chunk in pgvector comes from this PDF processed yesterday".
When to use: Data teams already thinking in pipeline terms; RAG as a versioned asset (vector_index_v3 depends on raw_documents_v3).
When NOT to use: Orchestrating a conversational agent at runtime. Interactive latency.
Alternatives: Prefect (simpler for ML teams without data-engineering culture), dbt (SQL transformations only, not general orchestration).
Apache Airflow
What it does: De facto batch ETL standard since 2015. DAGs with operators (PythonOperator, BashOperator, sensors). Central scheduler firing tasks by dependencies and cron.
When to use: Companies with Airflow already operated; massive nightly ingest jobs; data warehouse sync before reindexing.
When NOT to use: Anti-pattern: Airflow for chat or low-latency events — the scheduler is not designed for that. Agents with conversational state.
Alternatives: Prefect/Dagster (modern DX), cron + Docker (if the DAG has 2 steps).
Flyte
What it does: Typed workflow platform on Kubernetes. Each task is a container; typed inputs/outputs; strong caching and reproducibility. Widely used in ML at scale (Lyft, Spotify).
When to use: Training/fine-tuning + batch inference on K8s; pipelines with native GPU scheduling.
When NOT to use: Teams without K8s. RAG prototypes on a laptop.
Alternatives: Argo Workflows (less typed, more flexible), Kubeflow Pipelines.
Argo Workflows
What it does: Native Kubernetes workflow engine (Custom Resource Definition). Each step is a pod; parallelism via DAG. Integrates with the K8s ecosystem (Argo CD, events).
When to use: You already have K8s and want containerized pipelines without adding another cluster (Temporal, Prefect server).
When NOT to use: Pure serverless environments or VMs without K8s.
Alternatives: Flyte (more ML structure), Tekton (CI/CD more than data).
Kestra
What it does: Declarative orchestrator (YAML) with UI, plugins, and distributed execution. Less embedded Python code than Prefect; more configurable than cron.
When to use: Teams preferring YAML/GitOps for pipelines; plug-and-play integrations (S3, databases, notifications).
When NOT to use: Very dynamic agent logic (the agent graph changes according to the LLM at runtime — better Python code + Temporal or LangGraph).
Alternatives: Prefect (Python-first), Airflow (more mature ecosystem).
1.3 Durability vs simplicity — the spectrum
SIMPLICITY ──────────────────────────────────────────────▶ DURABILITY / COMPLEXITY
cron + script Celery/RQ + Redis Kafka + Postgres Prefect/Dagster Temporal
│ │ │ │ │
io.batch async tasks io.event-source pipelines io.trigger
nightly emails, jobs massive fan-out data assets multi-day HITL
indexing short template 10 reindexing onboarding
1.4 When is queue + DB state enough vs an orchestrator?
| Signal | Queue + DB (Celery, RQ, Kafka) | Orchestrator (Temporal, Prefect…) |
|---|---|---|
| Maximum flow duration | Seconds – few minutes | Hours – weeks |
| Intermediate human steps | Manual polling or pending_approval table |
Native signals (Temporal) or pauses (Prefect) |
| Compensation / saga | Manual (hard to maintain) | Native (Temporal) |
| Volume (events/hour) | High — horizontal workers | Medium — one workflow per business instance |
| Ops complexity | Low–medium (you already have Kafka) | High (additional Temporal cluster) |
| RAGorbit example | Template 10 (logistics) | Banking onboarding with io.trigger |
Course practical rule (expanded from §14):
- Real-time chat → FastAPI (
io.input) — no orchestrator. - Massive stateless events → Kafka (
io.event-source) — no Temporal. - Endless processes with humans → Temporal (
io.trigger). - Scheduled reindexing → cron, Prefect, Dagster, or Airflow (
io.batchas graph origin).
2. LLM serving / inference
Serving answers: "where does the model run, how do I serve tokens with low latency and high throughput, and when do I pay API vs own GPU?"
RAGorbit model.llm and model.embedding nodes consume an inference endpoint; this layer is independent of the agent framework.
2.1 Self-hosted — open source inference engines
| Engine | What it does | Strength | Limitation | When to choose |
|---|---|---|---|---|
| vLLM | LLM server with PagedAttention, continuous batching, OpenAI-compatible API | Very high GPU throughput; de facto OSS standard in 2025 for production | Requires NVIDIA GPU; non-trivial ops | High QPS, 7B–70B models, cost control at scale |
| TGI (Text Generation Inference) | Hugging Face server; optimized for transformers | Native HF integration, GPTQ/AWQ quantization | Less flexible than vLLM on some models | Hugging Face stack, deployment on HF Inference Endpoints or self-host |
| SGLang | Runtime with radix attention, aggressive batching | Throughput competitive with vLLM in recent benchmarks | Younger ecosystem (2024–2025) | Performance experimentation; multi-turn with prefix cache |
| llama.cpp | CPU/GPU inference in C++; GGUF | Runs on laptop, Apple Silicon, without datacenter GPU | Lower throughput than vLLM in cluster | Local development, edge, offline demos |
| Ollama | Friendly wrapper over llama.cpp (+ more backends) | ollama run llama3 — zero friction locally |
Not designed for multi-tenant production | Development, POCs, teams without GPU ops |
| TensorRT-LLM | NVIDIA optimization (kernels, FP8, inflight batching) | Maximum performance on NVIDIA hardware | NVIDIA lock-in; compilation curve | Demanding production on NVIDIA GPU at scale |
| LMDeploy | OpenMMLab serving; TurboMind quantization | Good balance on Chinese/alternative GPUs | Smaller community than vLLM outside Asia | Environments with hardware restrictions |
| Ray Serve | General serving framework (not LLM-only) | Composes LLM + preprocessing + postprocessing in one deployment | More pieces (Ray cluster) | Mixed ML pipelines: embed + rerank + LLM in one service |
| BentoML | Packages models as containerized APIs | Simple DX to deploy any model | Extra layer over vLLM/TGI | Teams wanting model CI/CD without writing FastAPI by hand |
2.2 Managed APIs (hosted inference)
| Provider | Consumption model | Strength | When NOT |
|---|---|---|---|
| OpenAI / Azure OpenAI | Pay-per-token | Quality, mature tool calling, enterprise SLA | Cost at scale; sensitive data without BAA/DPA contract |
| Amazon Bedrock | Pay-per-token + varied models (Claude, Llama, Titan) | AWS integration, native guardrails, VPC | AWS lock-in; variable latency by region |
| Google Vertex AI | Pay-per-token + Gemini + open models | GCP integration, native grounding | GCP lock-in |
| Anthropic direct | Claude API | Long reasoning, context window | No self-host of proprietary model |
| Together AI / Fireworks | API over open-weights models | Llama/Mistral without managing GPU | Less control than self-host |
| Groq | LPU inference — ultra-low latency | Very low TTFT on supported models | Limited catalog; not self-host |
When managed API: prototype, team without GPU ops, unpredictable spikes, compliance already resolved with provider.
When self-host: > 1M tokens/day sustained, data that cannot leave VPC, predictable latency on own fine-tuned model.
2.3 Embeddings serving — TEI and alternatives
Embeddings feed model.embedding → store.*. Unlike the generative LLM, embedding serving is cheaper and is usually the bottleneck at ingest, not in chat.
| Tool | What it does | When to use |
|---|---|---|
| TEI (Text Embeddings Inference) | Hugging Face server optimized for embedding models (sentence-transformers) | Self-host embeddings in ingest batch; OpenAI-compatible API |
| vLLM / TGI | Also serve some embedding models | If you already have the cluster and want a single stack |
Provider API (OpenAI text-embedding-3-*, Cohere, Voyage) |
Zero ops | Prototypes, low volumes, no GPU |
| Sentence Transformers local | model.encode() in process |
Small batch scripts (io.batch), development |
Throughput intuition: in nightly ingest, embedding is usually processed in batch (hundreds of chunks per call). In chat, it is usually 1 query → 1 vector — latency matters more than throughput.
2.4 Key concepts (intuition, not benchmarks)
| Concept | What it means for your RAG system |
|---|---|
| Continuous batching | Server groups concurrent requests in one GPU pass → more tokens/second, slight P95 latency increase |
| Quantization (INT8, INT4, GPTQ, AWQ) | Smaller model in memory → fits on cheaper GPU; may degrade quality on fine tasks |
| TTFT (time to first token) | Critical in streaming chat (io.output with SSE) — Groq and optimized vLLM compete here |
| TPOT (time per output token) | Critical in long responses (reports, extended JSON) |
| Prefix caching / KV cache | Reuse repeated context (system prompt, fixed documents) — saves cost in multi-turn |
Honesty 2025/2026: benchmarks published by each vendor are hard to compare (different hardware, model, batch size). Profile with your real prompt and your hardware before committing.
2.5 How to choose serving for RAGorbit
Can data leave your VPC?
NO → self-host (vLLM/TGI/TEI) or Azure OpenAI/Bedrock in VPC
YES → Sustained volume > economic threshold?
NO → managed API (OpenAI, Anthropic, Bedrock)
YES → self-host vLLM + TEI; LiteLLM gateway in front (§4)
Is it batch ingest or chat?
INGEST → TEI/vLLM with large batch; you do not need low TTFT
CHAT → vLLM/SGLang/Groq; SSE streaming; measure TTFT
3. Data pipelines / ingest at scale
This layer answers: "how do documents reach the vector store, how do I detect changes, and when do I reindex?" It is offline relative to chat — but determines RAG quality more than the LLM.
3.1 Technology map
| Tool | Paradigm | Scale | Best for | When NOT |
|---|---|---|---|---|
| Kafka | Distributed log; pub/sub; retention | Very high | Change events, audit trail, decouple ingest from indexing | Processing 100 PDFs/day |
| Apache Spark | Distributed batch/streaming processing | Massive (TB+) | Chunking + embedding millions of docs, heavy ETL | < 10 GB of documents |
| Ray Data | Distributed dataset on Ray cluster | High | Parallel ML pipelines (embed, transform) integrated with Ray Serve | Without Ray cluster |
| dbt | Versioned SQL transformations | Warehouse | Enrich tabular metadata before hybrid RAG | Binary PDF ingest |
| Apache Flink | Stream processing with state | High (streaming) | Near-real-time CDC → incremental reindex | Simple nightly batch |
| Apache Beam | Unified batch + streaming API (runner: Dataflow, Flink, Spark) | Variable | Portability across clouds | Small team without portability need |
| Cron + Python | Sequential script | Low–medium | Templates 02, 04 — io.batch |
TB of data, freshness SLA < 1 h |
3.2 Batch vs streaming
| Dimension | Batch (nightly, io.batch) |
Streaming (Kafka + Flink) |
|---|---|---|
| Index freshness | Hours (acceptable for HR policies, manuals) | Minutes (prices, inventory, news) |
| Complexity | Low | High |
| Cost | Minimum | Continuous infra |
| Idempotency | Re-run entire job | Upsert per document/event |
| Typical orchestration | cron, Prefect, Airflow | Kafka → consumer → embed → upsert; Flink for aggregations |
3.3 CDC (Change Data Capture) and reindexing
CDC captures changes in operational databases (PostgreSQL, MySQL) and publishes them as events — ideal for RAG over data that changes without re-reading the entire warehouse.
PostgreSQL ──(Debezium/Logical Replication)──▶ Kafka topic "doc-changes"
│
▼
consumer: delete old vectors
+ embed new version
+ upsert pgvector
When to full reindex vs incremental:
| Signal | Full reindex | Incremental (CDC / delta) |
|---|---|---|
| Embedding model changed | Yes — all vectors invalid | N/A |
| Chunking strategy changed | Yes | N/A |
| New document or edited paragraph | No | Yes — upsert/delete by ID |
| < 0.1% of docs change/day | Incremental overkill | Yes |
| Source is daily snapshot (S3 dump) | Yes — batch job | Streaming overkill |
3.4 Orchestrating ingest with orchestrators (§1)
Offline ingest usually lives in layer 1, not in the chat graph:
┌─────────────────────────────────────────────────────────────┐
│ ORCHESTRATOR (Prefect / Airflow / Dagster / cron) │
│ 1. loader.* → 2. ingest.* → 3. model.embedding │
│ 4. store.* (upsert) → 5. observability (metrics) │
└─────────────────────────────────────────────────────────────┘
▲ │
│ schedule │ index ready
│ ▼
io.batch (origin) RUNTIME: io.input / io.event-source
RAGorbit: the offline pipeline shares loader, ingest, store nodes with the graph; deploymentTarget: batch (io.batch) generates a CLI/cron job. See templates 02-banking and 04-insurance.
4. Deployment / runtime / serverless
This layer answers: "on which machine does each piece run and how do I expose it securely?"
M9 covered the four RAGorbit targets (FastAPI, Kafka worker, Temporal, batch). Here we expand the infrastructure spectrum and LLM gateways.
4.1 Containers and container orchestration
| Option | What it solves | When to use |
|---|---|---|
| Docker | Reproducible packaging | Always — base of any serious deployment |
| Docker Compose | Multi-container local/staging | Development, demos, template 10 with Kafka |
| Kubernetes (K8s) | Scheduling, autoscaling, secrets, ingress | Multi-service production, GPU scheduling, > 1 team |
| Helm charts | Package K8s manifests | vLLM, TGI, TEI in cluster — community charts exist |
| Nomad / ECS / Cloud Run (K8s-lite) | Less ops than full K8s | Small teams with managed containers |
RAGorbit mapping:
deploymentTarget |
Typical runtime |
|---|---|
chat-service |
K8s Deployment + Ingress, or Cloud Run, or VM + systemd |
event-worker |
K8s Deployment (replicas = consumer group), autoscale by Kafka lag |
temporal |
Temporal workers + managed or self-host Temporal cluster |
batch |
K8s CronJob, AWS Batch, or docker compose run job |
4.2 Serverless and on-demand GPU
| Platform | Model | Best for | Limitation |
|---|---|---|---|
| AWS Lambda / Azure Functions | Pay-per-invocation | Light APIs, preprocessing, not heavy LLM | Timeout (15 min max), no traditional GPU |
| Google Cloud Run | Serverless container | FastAPI chat-service with scale-to-zero | Cold start; GPU Cloud Run is newer — verify region |
| Modal | Python serverless with ephemeral GPU | On-demand GPU jobs, spot fine-tuning | Vendor-specific; unpredictable cost without limits |
| RunPod / Vast.ai | Bare-metal/on-demand GPU | Self-host vLLM without capex | Manual ops; no enterprise SLA out-of-the-box |
| HF Inference Endpoints | Managed TGI/vLLM | Fast deployment of HF models | Cost; less control than own cluster |
Rule: serverless works well for the agent wrapper (FastAPI, light Kafka consumer). The heavy LLM usually goes in a dedicated service (vLLM on persistent GPU or managed API), not in Lambda.
4.3 LLM gateways and proxies
Gateways sit in front of one or more backends (model.llm) and centralize cross-cutting concerns:
| Gateway | What it does | When to use |
|---|---|---|
| LiteLLM | OpenAI-compatible proxy; routes to 100+ providers; fallback, retry, budget | Multi-provider, dev/prod with same SDK, migrate OpenAI → Azure without code change |
| OpenRouter | Model marketplace via one API | Experiment with models without contract with each vendor |
| Portkey | Gateway with observability, semantic cache, guardrails | Multi-team production; cost metrics per project |
| Kong AI Gateway | Kong API gateway extension for LLM | Companies already using Kong for REST APIs |
| Envoy + ext_proc / APISIX | Generic gateway with plugins | Unified LLM + non-AI microservices infra |
Typical gateway functions:
- Routing: GPT-4 for complex cases, Llama-8B for cheap classification.
- Fallback: if OpenAI goes down → Bedrock.
- Rate limiting / budget: USD/day cap per API key.
- Semantic cache: identical or similar response without calling the LLM (watch sensitive data).
- Unified logging: complements
observability.auditand §12 observability.
Client / io.input ──▶ FastAPI ──▶ LiteLLM/Portkey ──▶ OpenAI | vLLM | Bedrock
│
├── rate limit
├── cost tracking
└── fallback
5. When to keep it simple?
Not every RAG system needs four enterprise layers. M9 taught the targets because you must recognize when to scale; not because you must use everything from day one.
5.1 The minimum viable stack
| Piece | Simple stack | Sufficient when… |
|---|---|---|
| Runtime | One FastAPI process (io.input) |
< 100 concurrent users, one team |
| Ingest | Python script + cron (io.batch) |
< 10k documents, nightly reindex |
| LLM | Direct OpenAI/Anthropic API | Prototype or low volume |
| Embeddings | API or sentence-transformers in batch script |
Same condition |
| State / queue | PostgreSQL + nothing else | No massive events, no multi-day workflows |
| Observability | Logs + Langfuse free tier | No regulatory audit trail requirement |
5.2 Signals that you need to scale
| Signal | Symptom | Layer to add |
|---|---|---|
| P95 latency > SLA in chat | Users wait > 3 s for first token | Dedicated serving (vLLM), gateway with fallback |
| Reconnections duplicate actions | Double charge, double booking | guardrail.idempotency + Redis (M9) |
| > 10k events/hour | Single queue saturated | Kafka (io.event-source) + horizontal workers |
| Workflow > 24 h with humans | Unmanageable state in tables | Temporal (io.trigger) |
| Index stale > 4 h | Users see obsolete info | Streaming CDC or more frequent ingest |
| LLM cost > budget | Unpredictable bill | Gateway routing to cheap models + cache |
| Regulatory audit | Cannot reconstruct who did what | observability.audit → Kafka + retention |
| Multi-team on same cluster | Deploy conflicts | K8s + namespaces; Dagster/Prefect for pipelines |
5.3 Deliberate anti-complexity
| Temptation | Reality | Keep simple with |
|---|---|---|
| "Let's add Temporal just in case" | Cluster ops without multi-day workflows | FastAPI + Postgres |
| "Spark for 500 PDFs" | JVM cluster for hours of work | io.batch + Python |
| "K8s on day one" | YAML before product-market fit | Docker Compose or PaaS |
| "Self-host vLLM for 10 queries/day" | GPU idle 99% | Managed API |
6. Master decision table
6.1 By business scenario
| Scenario | RAGorbit input | Orchestration (§1) | Serving (§2) | Data (§3) | Deployment (§4) |
|---|---|---|---|---|---|
| Real-time chat (copilot, HR bot) | io.input |
None (request/response) | Managed API or vLLM + SSE | Nightly batch (io.batch) |
FastAPI on K8s/Cloud Run |
| Massive event-driven (disruptions, fraud) | io.event-source |
Kafka + workers; not Temporal | Fast API (Groq) or rules without LLM | Kafka as event bus | Workers autoscale by lag |
| Nightly batch (scoring, claims) | io.batch |
cron / Prefect / Airflow | TEI batch + LLM only at inference | Spark if > TB; else Python | CronJob / AWS Batch |
| Long human-in-the-loop workflow (onboarding, prior auth) | io.trigger |
Temporal | Managed API (low volume) | Initial batch + spot updates | Temporal workers + FastAPI for UI |
| High volume on-prem (banking, defense) | io.input + io.event-source |
Kafka + Temporal only where sagas exist | vLLM + TEI self-host | Spark/Flink + CDC | On-prem K8s + LiteLLM gateway |
| Demo / workshop | io.input |
None | Local Ollama | Manual script | Gradio / local uvicorn |
6.2 Frequent anti-patterns
| Anti-pattern | Why it fails | Alternative |
|---|---|---|
| Airflow for chat | Scheduler in seconds/minutes, not milliseconds | FastAPI + io.input |
| Temporal for a simple cron | Temporal cluster for a 5 min/night job | io.batch + cron |
| Kafka for 10 messages/day | Broker ops without benefit | Webhook + FastAPI or SQS |
| Spark for 200 PDF ingest | JVM/cluster overhead | Python + io.batch |
| LLM in Lambda | Timeout, no GPU, cold start on large models | Managed API or dedicated vLLM |
| Full reindex every hour | Unnecessary embedding cost | Incremental CDC |
| One expensive model for everything | Simple classification at GPT-4 price | Gateway: small model → large if low confidence |
| Observability only in LangSmith | Lock-in; no Kafka/infra metrics | §12: audit + Langfuse + OTel |
6.3 Quick decision tree (graph input)
Is the input conversational in real time?
YES → io.input (chat-service)
NO → Triggered by broker events?
YES → io.event-source (event-worker)
NO → Business process duration?
> 1 day or HITL with waits → io.trigger (temporal)
NO → io.batch (batch/cron)
Closing — how it all fits in RAGorbit
A mature RAG/agentic system does not choose a single tool — it combines layers:
[OFFLINE — Layer 3 + orchestrator §1]
io.batch → loader.* → ingest.* → model.embedding → store.*
(Prefect/Airflow/cron)
[RUNTIME — Layer 2 + 4 + RAGorbit graph]
io.input | io.event-source | io.trigger
→ agent.* + retrieval + guardrails
→ model.llm (via gateway §4)
→ observability.* → io.output | io.notify
[CROSS-CUTTING — M9 + §12]
observability.audit (Kafka) + Langfuse (dev) + OTel (prod)
guardrail.* + hitl.escalate
Your job as an engineer is not to master all 30 tools in this document, but to place each piece in the correct layer, recognize scaling signals (§5), and avoid anti-patterns (§6.2).
Cross-links
- M9 — Production & Security: guia.md §5–§6 —
io.*nodes, deployment targets, guardrails, Temporal vs Kafka comparison (base this document expands).- Expanded orchestration:
tecnologias-comparadas.md§14 — Temporal vs queues + DB vs cron table.- Observability:
tecnologias-comparadas.md§12 — LangSmith vs Langfuse vs OTel; complementsobservability.audit,.metrics,.feedback.- IO node catalog:
io.input,io.event-source,io.trigger,io.batch.- Anchor templates: 01-airline (chat + guardrails), 10-logistics (Kafka fan-out), 02-banking (batch), 03-healthcare (HITL).
- Concepts:
docs/01-concepts.md§5 —deploymentTargettable and port types.