sparklesEngineering

AI Workflows & Integration

Engineering around LLMs: reliability and retries, cost control, structured output, orchestration, RAG, observability.

1 item

Links1

01NotesNote

Treating an LLM call like a normal function call is the root of most pain. They are slow, non-deterministic, rate-limited, occasionally wrong, and billed per token. Engineer around those properties.

Reliability

  • Wrap every model/tool call so it's retryable (idempotent where possible) with exponential backoff on 429/5xx. Temporal activities are an excellent home for this.
  • Set hard timeouts — a hung generation shouldn't wedge a request or a workflow.
  • Have a fallback path: cheaper/smaller model, cached answer, or graceful degradation when the primary is down or over budget.

Cost & rate control

  • Cache deterministic-ish calls (embeddings of fixed text, repeated prompts) in Redis/Postgres. Embeddings especially — recomputing them is pure waste.
  • Rate-limit and budget per user/tenant to protect spend (sliding-window in Redis).
  • Track tokens in/out per call; log them so cost is observable, not a month-end surprise.

Structured output

  • Demand structured output (JSON schema / tool-calling / function-calling) instead of parsing prose. Validate with Pydantic on the way out; on validation failure, retry with the error fed back to the model.
  • Keep a strict boundary: the model proposes, your code validates and decides. Never let raw model output trigger irreversible side effects unchecked.

Orchestration patterns

  • Pipeline/chain: each step is an activity; pass typed state between them.
  • Agent loop: model picks a tool → you execute it in code → feed the result back. Cap the loop iterations and total tokens to prevent runaway cost.
  • Fan-out/fan-in: run independent sub-tasks in parallel, then aggregate.
  • Human-in-the-loop: pause on a Temporal signal for approval before anything consequential.

RAG & data integration

  • Store embeddings in pgvector (keeps vectors next to your relational data — one less system) or a dedicated vector DB at larger scale.
  • Retrieval quality dominates output quality: invest in chunking strategy, hybrid search (keyword + vector), and re-ranking before blaming the model.

Async by nature

  • Generation is too slow for a blocking request. Kick off the work, return a job ID, and stream results (SSE/WebSocket) or poll. Do the heavy lifting in a worker/Temporal, not in the request thread.

Observability

  • Log prompts, responses, token counts, latency, and model version for every call. You cannot debug or improve an AI feature you can't see. Evaluate outputs against a fixed test set before shipping prompt or model changes.