AI Workflows & Integration

Treating an LLM call like a normal function call is the root of most pain. They are slow, non-deterministic, rate-limited, occasionally wrong, and billed per token. Engineer around those properties.

Reliability

Wrap every model/tool call so it's retryable (idempotent where possible) with exponential backoff on 429/5xx. Temporal activities are an excellent home for this.
Set hard timeouts — a hung generation shouldn't wedge a request or a workflow.
Have a fallback path: cheaper/smaller model, cached answer, or graceful degradation when the primary is down or over budget.

Cost & rate control

Cache deterministic-ish calls (embeddings of fixed text, repeated prompts) in Redis/Postgres. Embeddings especially — recomputing them is pure waste.
Rate-limit and budget per user/tenant to protect spend (sliding-window in Redis).
Track tokens in/out per call; log them so cost is observable, not a month-end surprise.

Structured output

Demand structured output (JSON schema / tool-calling / function-calling) instead of parsing prose. Validate with Pydantic on the way out; on validation failure, retry with the error fed back to the model.
Keep a strict boundary: the model proposes, your code validates and decides. Never let raw model output trigger irreversible side effects unchecked.

Orchestration patterns

Pipeline/chain: each step is an activity; pass typed state between them.
Agent loop: model picks a tool → you execute it in code → feed the result back. Cap the loop iterations and total tokens to prevent runaway cost.
Fan-out/fan-in: run independent sub-tasks in parallel, then aggregate.
Human-in-the-loop: pause on a Temporal signal for approval before anything consequential.

RAG & data integration

Store embeddings in pgvector (keeps vectors next to your relational data — one less system) or a dedicated vector DB at larger scale.
Retrieval quality dominates output quality: invest in chunking strategy, hybrid search (keyword + vector), and re-ranking before blaming the model.

Async by nature

Generation is too slow for a blocking request. Kick off the work, return a job ID, and stream results (SSE/WebSocket) or poll. Do the heavy lifting in a worker/Temporal, not in the request thread.

Observability

Log prompts, responses, token counts, latency, and model version for every call. You cannot debug or improve an AI feature you can't see. Evaluate outputs against a fixed test set before shipping prompt or model changes.

Links1