sparklesEngineering
AI Workflows & Integration
Engineering around LLMs: reliability and retries, cost control, structured output, orchestration, RAG, observability.
1 item
Links1
01NotesNote
Treating an LLM call like a normal function call is the root of most pain. They are slow, non-deterministic, rate-limited, occasionally wrong, and billed per token. Engineer around those properties.
Reliability
- Wrap every model/tool call so it's retryable (idempotent where possible) with exponential backoff on 429/5xx. Temporal activities are an excellent home for this.
- Set hard timeouts — a hung generation shouldn't wedge a request or a workflow.
- Have a fallback path: cheaper/smaller model, cached answer, or graceful degradation when the primary is down or over budget.
Cost & rate control
- Cache deterministic-ish calls (embeddings of fixed text, repeated prompts) in Redis/Postgres. Embeddings especially — recomputing them is pure waste.
- Rate-limit and budget per user/tenant to protect spend (sliding-window in Redis).
- Track tokens in/out per call; log them so cost is observable, not a month-end surprise.
Structured output
- Demand structured output (JSON schema / tool-calling / function-calling) instead of parsing prose. Validate with Pydantic on the way out; on validation failure, retry with the error fed back to the model.
- Keep a strict boundary: the model proposes, your code validates and decides. Never let raw model output trigger irreversible side effects unchecked.
Orchestration patterns
- Pipeline/chain: each step is an activity; pass typed state between them.
- Agent loop: model picks a tool → you execute it in code → feed the result back. Cap the loop iterations and total tokens to prevent runaway cost.
- Fan-out/fan-in: run independent sub-tasks in parallel, then aggregate.
- Human-in-the-loop: pause on a Temporal signal for approval before anything consequential.
RAG & data integration
- Store embeddings in pgvector (keeps vectors next to your relational data — one less system) or a dedicated vector DB at larger scale.
- Retrieval quality dominates output quality: invest in chunking strategy, hybrid search (keyword + vector), and re-ranking before blaming the model.
Async by nature
- Generation is too slow for a blocking request. Kick off the work, return a job ID, and stream results (SSE/WebSocket) or poll. Do the heavy lifting in a worker/Temporal, not in the request thread.
Observability
- Log prompts, responses, token counts, latency, and model version for every call. You cannot debug or improve an AI feature you can't see. Evaluate outputs against a fixed test set before shipping prompt or model changes.