repeatEngineering

Temporal — Durable Workflows

Durable execution for long-running and AI workflows: workflows vs activities, signals, retries, watch-outs.

1 item

Links1

01NotesNote

The problem Temporal solves

Long-running, multi-step processes that must survive crashes, restarts, and retries — onboarding flows, payment + provisioning sequences, multi-step AI pipelines. Doing this with cron + a state column + ad-hoc retries gets fragile fast. Temporal gives you durable execution: your workflow code resumes exactly where it left off, even if the worker dies mid-run.

Core mental model

  • Workflow — orchestration logic. Must be deterministic: no direct I/O, no random, no datetime.now(), no network calls. Temporal replays workflow code to rebuild state, so it must produce the same path given the same history.
  • Activity — where all the side effects live: DB writes, HTTP calls, LLM calls. Activities are retried automatically with configurable backoff.
  • The engine persists every step to its event history, so a restarted worker replays the history and continues seamlessly.

Why it fits AI workflows especially well

  • LLM/API calls are slow, flaky, and rate-limited → wrap each in an Activity and get automatic retries with backoff for free.
  • Multi-agent / multi-step chains (plan → call tool → summarize → call again) map cleanly onto a workflow with sequential/parallel activities.
  • Human-in-the-loop: a workflow can wait — for days if needed — on a signal (approval, user reply) without burning resources.
  • Built-in timeouts per activity and per workflow stop a hung model call from wedging the pipeline.

Key features to lean on

  • Signals — send data into a running workflow (e.g. "user approved").
  • Queries — read a running workflow's current state without affecting it.
  • Child workflows — compose and fan out.
  • Heartbeats — for long activities, so the engine can detect a dead worker and reschedule.
  • Versioning / patching — safely deploy changes to workflows that are mid-flight.

Watch-outs

  • Keep non-determinism out of workflow code — the most common beginner bug. Anything non-deterministic goes in an activity.
  • It's an extra piece of infrastructure (server + workers). For genuinely simple background jobs, a Postgres/Redis queue is lighter. Reach for Temporal when durability and orchestration are the actual requirement.