Temporal on Kubernetes: Why I Actually Reach for It (And the Time It Saved Me)

I reach for Temporal whenever a workflow spans more than two API calls or touches more than one data store. Everything else is a script. I learned this after a workflow crashed mid-run and I spent half a day figuring out what state was lost.

Temporal is not the easiest workflow tool. It has a server, a persistence backend, workers, task queues, and a learning curve. But once I understood the model, it became the default for anything I cannot afford to lose mid-flight.

This post is why I use it, and what I think a reasonable Kubernetes deployment looks like. It is based on a real incident that cost me half a day.

What Temporal Actually Gives You

The core idea is durable execution. Temporal records every event in a workflow’s history. If the worker process dies, a new worker replays that history and resumes exactly where the previous one stopped. This is not retry logic you wrote. It is a property of the platform.

For AI pipelines, this matters because LLM APIs are flaky. A call times out, a pod gets evicted, a deployment rolls out. Without durability, those events leave partial state behind: half-written database rows, sent webhooks, charged LLM calls with nowhere to store the result.

I learned this the hard way. I had a Python script that processed AI-generated blog posts: generate content, store in database, create embedding, post to Slack. The script crashed during the embedding step because the embedding API timed out. I had 5 posts in the database, 3 embeddings created, and 2 Slack notifications sent. The remaining 3 posts were generated but not stored. I spent 4 hours reconciling the state: checking which posts existed, which had embeddings, which were posted to Slack, and manually fixing the gaps.

With Temporal, that script would have been a workflow. The embedding activity would have timed out, Temporal would have retried it, and the workflow would have completed. No manual reconciliation. No lost afternoon.

The Mental Model

Temporal separates code into two things:

Workflows: orchestrate the steps. They must be deterministic because they replay from history.
Activities: do the actual work. They are the only place side effects happen.

This separation feels strict at first, but it is what makes retries and compensation possible. A workflow can retry an activity, or run a compensation activity if something downstream fails. You cannot do that cleanly if business logic is mixed with orchestration.

My blog post processing workflow now looks like this:

from temporalio import workflow
from temporalio.activity import activity

@workflow.defn
class BlogPostWorkflow:
    @workflow.run
    async def run(self, post: dict) -> dict:
        # Activity: generate content
        content = await workflow.execute_activity(
            generate_content,
            post['topic'],
            start_to_close_timeout=timedelta(minutes=5),
            retry_policy=RetryPolicy(maximum_attempts=3),
        )

        # Activity: store in database
        post_id = await workflow.execute_activity(
            store_post,
            {'content': content, 'title': post['title']},
            start_to_close_timeout=timedelta(seconds=30),
        )

        # Activity: create embedding
        await workflow.execute_activity(
            create_embedding,
            {'post_id': post_id, 'content': content},
            start_to_close_timeout=timedelta(minutes=2),
            retry_policy=RetryPolicy(maximum_attempts=3),
        )

        # Activity: post to Slack
        await workflow.execute_activity(
            post_to_slack,
            {'post_id': post_id, 'title': post['title']},
            start_to_close_timeout=timedelta(seconds=10),
        )

        return {'post_id': post_id}

Each step is an activity with its own retry policy. If the embedding API times out, Temporal retries it. If the database is down, Temporal waits and retries. If the worker crashes, a new worker picks up the workflow and resumes from the last completed activity.

When I Use It

I use Temporal for:

Multi-step AI pipelines with external side effects.
Workflows that wait for humans or external systems.
Anything where losing state would require manual cleanup.
Processes that need to survive pod restarts or API outages.

I do not use it for:

One-off scripts.
Simple webhook-to-notification flows.
Anything where rerunning from the start is good enough.

The blog post incident was the tipping point. Before that, I thought Temporal was overkill for my homelab. After that, I realized that “homelab” does not mean “does not matter.” My time matters.

Kubernetes Deployment: My Baseline

For a production-like setup, I run:

A dedicated temporal namespace.
PostgreSQL for persistence.
Elasticsearch for visibility (optional but strongly recommended).
The Temporal server as a Deployment or via Helm.
Worker pods as separate Deployments per task queue.
Secrets for PostgreSQL credentials, provider API keys, and encryption.

I keep workers separate from the server because they are the part that changes most often. Updating a worker should not require restarting the server. I learned this from my n8n setup, where the worker and server were coupled and a worker update caused downtime.

The Idempotency Rule

Temporal workflows are exactly-once, but activities are at-least-once. If a worker crashes after an activity starts but before the result is acknowledged, Temporal will retry that activity on a new worker.

That means every activity with a side effect must be idempotent. LLM calls should use idempotency keys if the provider supports them. Database writes should be upserts. External notifications should be guarded by deduplication.

This is the part people skip, and it is the part that causes real incidents. I skipped it initially. The blog post workflow had a bug where duplicate Slack notifications were sent after a retry. I fixed it by adding a deduplication key based on the post ID.

The Saga Pattern

For workflows with multiple non-reversible steps, I use the saga pattern: each step has a corresponding compensation activity. If step 3 fails, I run the compensations for steps 2 and 1.

Example for my blog post pipeline:

Step 1: Generate content. Compensation: nothing to undo.
Step 2: Store in database. Compensation: delete the post.
Step 3: Create embedding. Compensation: delete the embedding.
Step 4: Post to Slack. Compensation: delete the Slack message.

This is not free to implement, but it is cheaper than cleaning up partial state manually at 2 AM. I have not needed the saga pattern yet because my activities are idempotent and reversible. But I have the structure ready.

What I Watch

The metrics that matter to me:

Workflow success and failure rates. I alert if failure rate > 5%.
Activity retry rates. High retry rates mean flaky APIs or bad timeouts.
Worker poll latency. If this grows, workers are overloaded.
Task queue backlog. If this grows, I need more workers.
End-to-end workflow duration. For blog posts, this should be under 10 minutes.

Temporal exposes these through Prometheus and its Web UI. The Web UI is especially useful because it lets you inspect the exact event history of a failed workflow. I used this to debug the duplicate Slack notification bug.

Common Mistakes I Made

Putting side effects in workflows. Workflows replay. Side effects must be in activities. I made this mistake early and got duplicate database writes.
Ignoring activity idempotency. This is the fastest way to create duplicate charges or writes. I learned this with the Slack notifications.
Running one giant worker deployment. Separate workers by task queue and failure domain. I started with one deployment and had to split it later.
Skipping Elasticsearch. Without it, debugging workflows becomes much harder. I added it after struggling with CLI-based debugging.

Conclusion

Temporal is not a tool for every workflow. It is a tool for workflows where failure is expensive and state matters. On Kubernetes, it adds some operational complexity, but it pays for itself the first time a worker crashes mid-pipeline and you do not have to do anything.

Start with the problem, not the tool. If the problem needs durability, Temporal is usually the right answer. The blog post incident cost me half a day. Temporal would have prevented it. That is why I use it now.