repeatEngineering
Temporal — Durable Workflows
Durable execution for long-running and AI workflows: workflows vs activities, signals, retries, watch-outs.
1 item
Links1
01NotesNote
The problem Temporal solves
Long-running, multi-step processes that must survive crashes, restarts, and retries — onboarding flows, payment + provisioning sequences, multi-step AI pipelines. Doing this with cron + a state column + ad-hoc retries gets fragile fast. Temporal gives you durable execution: your workflow code resumes exactly where it left off, even if the worker dies mid-run.
Core mental model
- Workflow — orchestration logic. Must be deterministic: no direct I/O, no random, no
datetime.now(), no network calls. Temporal replays workflow code to rebuild state, so it must produce the same path given the same history. - Activity — where all the side effects live: DB writes, HTTP calls, LLM calls. Activities are retried automatically with configurable backoff.
- The engine persists every step to its event history, so a restarted worker replays the history and continues seamlessly.
Why it fits AI workflows especially well
- LLM/API calls are slow, flaky, and rate-limited → wrap each in an Activity and get automatic retries with backoff for free.
- Multi-agent / multi-step chains (plan → call tool → summarize → call again) map cleanly onto a workflow with sequential/parallel activities.
- Human-in-the-loop: a workflow can wait — for days if needed — on a signal (approval, user reply) without burning resources.
- Built-in timeouts per activity and per workflow stop a hung model call from wedging the pipeline.
Key features to lean on
- Signals — send data into a running workflow (e.g. "user approved").
- Queries — read a running workflow's current state without affecting it.
- Child workflows — compose and fan out.
- Heartbeats — for long activities, so the engine can detect a dead worker and reschedule.
- Versioning / patching — safely deploy changes to workflows that are mid-flight.
Watch-outs
- Keep non-determinism out of workflow code — the most common beginner bug. Anything non-deterministic goes in an activity.
- It's an extra piece of infrastructure (server + workers). For genuinely simple background jobs, a Postgres/Redis queue is lighter. Reach for Temporal when durability and orchestration are the actual requirement.