Skip to content

Instantly share code, notes, and snippets.

@jordotech
Last active April 14, 2026 17:41
Show Gist options
  • Select an option

  • Save jordotech/55cce3654c400e2a7de6101f37fe0b27 to your computer and use it in GitHub Desktop.

Select an option

Save jordotech/55cce3654c400e2a7de6101f37fe0b27 to your computer and use it in GitHub Desktop.
Incident Report: ey-eu-west-1 Workflow Queue Pile-Up (2026-04-14)

Incident Report: ey-eu-west-1 Workflow Queue Pile-Up

Date: 2026-04-14 Duration: ~05:00 - 07:00 CST (11:00 - 13:00 UTC) Environment: ey-eu-west-1 Severity: S2 — User-facing workflow delays, incomplete executions Status: Investigating


Timeline

gantt
    title Incident Timeline (CST)
    dateFormat HH:mm
    axisFormat %H:%M

    section Trigger
    EU business hours begin / workflow burst    :crit, 05:00, 06:00

    section Detection
    Queue depth rising, workflows stalling      :active, 05:15, 06:30
    OOMKill pods cycling (all 12 pods)          :crit, 05:20, 07:00

    section Response
    Jordan alerted via phone call               :milestone, 06:50, 0min
    Manual pod count increase                   :06:50, 07:00
    Queue begins draining                       :07:00, 07:30

    section Recovery
    Queue fully drained                         :07:30, 08:00
Loading

Architecture Context

flowchart LR
    subgraph API["API Pods"]
        A[FastAPI Router]
    end

    subgraph Redis
        Q["workflow_jobs\n(Redis List)"]
        H["worker:jobs:{pod}\n(Heartbeat Hash)"]
    end

    subgraph Workers["Worker Pods (12 min, 96 max)"]
        W1["Pod 1\nconcurrency=1\nmem_limit=3Gi"]
        W2["Pod 2\nconcurrency=1\nmem_limit=3Gi"]
        W3["Pod N..."]
    end

    subgraph KEDA["KEDA Autoscaler"]
        K1["Redis list length trigger\nthreshold=1"]
        K2["Prometheus active jobs trigger\nthreshold=1"]
    end

    A -->|LPUSH| Q
    Q -->|BRPOP| W1
    Q -->|BRPOP| W2
    Q -->|BRPOP| W3
    W1 -->|heartbeat refresh| H
    W2 -->|heartbeat refresh| H

    Q --> K1
    K2 -->|"sum(workflow_jobs_active)"| Workers
    KEDA -->|scale| Workers
Loading

Root Cause: Three Interacting Failures

flowchart TD
    WF["Workflow submitted\n(memory-hungry)"] --> BRPOP["Pod picks up job\nvia BRPOP"]
    BRPOP --> EXEC["Workflow executing\nmemory climbing"]

    EXEC --> OOM{"Container memory > 3Gi\nlimit?"}
    OOM -->|Yes| SIGKILL["SIGKILL (code 137)\nNo finally block\nNo cleanup"]
    OOM -->|No| LIVE{"Health server\nresponsive?"}

    LIVE -->|No: event loop blocked| LIVEKILL["Liveness probe timeout\nK8s kills pod"]
    LIVE -->|Yes| COMPLETE["Workflow completes\nnormally"]

    SIGKILL --> PARTIAL["Partial results\n3/5 nodes persisted\nStatus stuck 'running'"]
    LIVEKILL --> PARTIAL

    PARTIAL --> HEARTBEAT["Heartbeat expires\n(TTL=120s, no refresh)"]
    HEARTBEAT --> SWEEP["Recovery sweeper\nre-enqueues job"]
    SWEEP --> BRPOP

    style SIGKILL fill:#ff6b6b,color:#fff
    style LIVEKILL fill:#ff6b6b,color:#fff
    style PARTIAL fill:#ffa94d,color:#fff
    style SWEEP fill:#ffa94d,color:#fff
Loading

Problem 1: OOM-Kill + Recovery = Poison Pill Loop

A single workflow can consume enough memory to push container RSS past the 3Gi limit. Even with concurrency=1, a memory-hungry workflow exceeds the cgroup limit and gets OOMKilled (SIGKILL, code 137).

The SIGKILL bypasses Python's finally block in the executor, so:

  • persist_node_results() never runs
  • update_workflow_run_status() never runs
  • Workflow status stays "running" or gets incorrectly marked by a later recovery attempt

The recovery sweeper detects the dead heartbeat after 120s and re-enqueues the same job. Another pod picks it up, hits the same memory wall, gets OOMKilled. Infinite loop.

Problem 2: Liveness Probe Kills Healthy Pods

Long-running workflows (even those within memory limits) can saturate the asyncio event loop. The aiohttp health server on port 8080 goes unresponsive. K8s liveness probe times out after 3 consecutive failures (30s period x 3 = 90s window) and kills the pod. Same cascade as OOM: no cleanup, heartbeat dies, sweeper re-enqueues.

Problem 3: Incomplete Workflow Results

Users see 3/5 nodes completed, then nothing. Because SIGKILL prevents the finally block:

  • Nodes that completed before the kill have results (persisted via WebSocket/Redis during execution)
  • Remaining nodes never execute
  • No failure notification reaches the user

Evidence: Pod OOMKill Data (kubectl)

All 12 worker pods show OOMKilled or Exit Code 137 as their last termination reason. 398 total restarts across the fleet in ~4.5 days:

Pod Restarts Last Reason Last OOMKill Time (UTC)
h4nzf 62 OOMKilled 14:44
7vvj5 56 OOMKilled 15:32
lrv5w 55 OOMKilled 13:50
682fx 53 OOMKilled 14:58
9bfxc 48 Error (137) 12:40
j7fgp 43 OOMKilled 13:36
5zj7j 40 OOMKilled 13:43
n4qhv 29 OOMKilled 14:24
dd2d9 6 Error (137) 15:26
knf8t 3 OOMKilled 13:29
xqm7q 2 OOMKilled 15:26
lzx8s 1 OOMKilled 13:08

Evidence: Prometheus Memory Metrics

The worker exposes workflow_worker_rss_bytes (Python process peak RSS via getrusage(RUSAGE_SELF).ru_maxrss) and workflow_job_memory_delta_bytes (RSS change per workflow job).

Python Process RSS (surviving pods, last 6h)

Pod IP Peak RSS Note
10.0.11.148 1.01 GB Sustained 1.01GB from 11:27-14:49 UTC
10.0.2.215 1.00 GB
10.0.2.107 0.99 GB
10.0.10.154 0.99 GB
10.0.10.172 0.94 GB
10.0.2.248 0.85 GB
10.0.1.219 0.80 GB
10.0.0.110 0.65 GB

Per-Job Memory Delta (P99, last 12h)

Pod IP P99 Memory Delta Note
10.0.2.107 995 MB Single workflow added ~1GB to process RSS
10.0.2.215 990 MB
10.0.10.154 985 MB
10.0.11.148 975 MB
10.0.2.248 950 MB
10.0.10.172 940 MB

RSS Timeline (pod 10.0.11.148 — characteristic spike)

07:37 UTC: 0.45 GB  =========
07:53 UTC: 0.00 GB  (pod restarted after OOMKill)
08:53 UTC: 0.35 GB  =======  (baseline after restart)
11:23 UTC: 0.37 GB  =======
11:27 UTC: 1.01 GB  ====================  ← workflow starts, +640MB spike
11:28-14:49: 1.01 GB sustained (memory never freed — Python heap fragmentation)
14:50 UTC: 0.99 GB
14:54 UTC: 0.00 GB  (pod OOMKilled again)

Why Python RSS < 3Gi but pods still OOMKill

workflow_worker_rss_bytes measures only the Python process RSS (getrusage(RUSAGE_SELF)). The container cgroup memory limit (3Gi) includes:

  • Python process RSS (~1GB peak observed)
  • Shared libraries and mmap'd files
  • Page cache from file I/O (document generation, S3 downloads)
  • Kernel overhead and slab cache
  • Memory from subprocess calls (e.g., node.js for document generation)

The ~2GB gap between observed Python RSS (1GB) and the 3Gi OOM threshold is consumed by these non-Python allocations. For the pods that did OOMKill, we can't see their final RSS because the Prometheus scrape (15-30s interval) misses the spike — the pod dies before the next scrape.


Evidence: Poison Pill Workflow IDs (from Coralogix logs)

These workflows appeared on OOMKilled pods and were re-enqueued multiple times:

Workflow Workflow ID Times Seen Session
Nursery AI Agent 0b84cd38-5c5e-4280-859d-893252a8a3f0 5x 09f8c491
EYP-Valutation-Approach Mapping 55811da3-0b6f-4548-8462-beb2608349af 3x
Proposal development Nordics ff90fc3a-41d6-4ef6-929c-bddc4e305ff4 3x
Agentic workflow Carvature Transactions 85ebc6c8-3386-4f2e-98ef-388a45d76383 2x
Competition Workflow (Research agent) a73e7534-9a26-48ec-ac96-f940cd61c5a6 2x
Deal Intelligence 211339d8-7f27-43f5-b4a5-33ce50cac69a 2x

Nursery AI Agent is the clearest poison pill — 5 appearances across multiple OOMKilled pods in a single session.


Evidence: Workflow Composition Analysis (from DB)

All five workflows belong to org 8370c192 (EY). Database analysis of the latest workflow version payloads reveals what makes them memory-heavy.

Workflow Profiles

Workflow Nodes Input Files Total Input Size Output Node Model
Nursery AI Agent 3 Ofsted report 200 entries.xlsx (6.2MB), PreK List.xlsx (1.3MB) 7.5 MB Data Analysis Agent Opus
Proposal dev Nordics 3 EY-Parthenon...Proposal.pptx (6.4MB), latest_word_document.pdf (616KB) 7.1 MB EY PowerPoint Opus
EYP-Valutation Mapping 8 Execution approach.pptx (4MB), Kabanga RFP.pdf (2.2MB), playbook.docx (84KB) 6.3 MB Data Analysis + 2 Agents + 2 Human Intervention Opus + Sonnet
Competition Workflow 11 Competition template.xlsx (26KB) 26 KB 3 Agents + Spreadsheet + 6 tool nodes (Serpapi, Exa, Elasticsearch, Composio) Opus
Deal Intelligence 10 Opportunities screening.xlsx (22KB) 22 KB Agent + Spreadsheet + 8 tool nodes Opus

Run History (Nursery AI Agent — worst offender)

Run ID Status Duration Node Results Output Size
6bd8b67f FAILED 4h 35m 2 completed, 1 failed 730 KB
48ee130b SUCCESS 4m 3s 3 completed 757 KB
31055bad FAILED 10s 2 completed, 1 failed 755 KB
af57641d RUNNING stuck 0 results 0 bytes

The 6bd8b67f run ran for 4.5 hours before failing — a single workflow occupying one pod for that entire duration. The af57641d run is still stuck in RUNNING status (orphaned after OOMKill, never cleaned up).

Memory Consumption Patterns

Four distinct patterns cause high memory:

1. Large file parsing (Nursery, Proposal dev, EYP-Valutation)

  • 6-7MB input files (xlsx, pptx, pdf) are downloaded from S3 and parsed entirely into Python memory
  • xlsx parsing via openpyxl/pandas can inflate a 6MB file to 50-100MB in-memory representation
  • The parsed content is then serialized into the LLM prompt, creating another large string allocation

2. Document generation subprocesses (Proposal dev, EYP-Valutation)

  • ey_powerpoint_chat and data_analysis nodes spawn node.js child processes for document generation
  • Child process memory is counted against the container cgroup but invisible to RUSAGE_SELF
  • This explains the 2GB gap between Python RSS (1GB) and OOMKill threshold (3Gi)

3. Multi-tool accumulation (Competition, Deal Intelligence)

  • 6-8 external API tool calls (Serpapi, Exa, Elasticsearch, Composio) each return response data
  • All responses are held in memory as part of the workflow execution context until completion
  • 11 nodes × average response size compounds into significant memory pressure

4. Opus model context windows

  • All five workflows use Claude Opus (except 2 agent nodes in EYP-Valutation using Sonnet)
  • Opus supports larger context → larger request/response payloads held in memory during API calls
  • Streaming responses accumulate in buffers before being written to node results

Configuration at Time of Incident

Setting Value Notes
worker_concurrency 1 Intentionally low to limit per-pod memory
worker_memory_limit 3Gi Insufficient for large workflows
worker_memory_request 1.5Gi
worker_cpu_limit 1.5
worker_cpu_request 0.5
keda_min_replicas 12
keda_max_replicas 96
keda_cooldown_period 120s
Liveness probe enabled /healthz on port 8080, 30s period, 3 failures
Readiness probe enabled /healthz on port 8080, 15s period, 3 failures
termination_grace_period 300s Irrelevant for SIGKILL (OOM)
Prometheus retention 15d Application-level only (no cAdvisor)

Capacity Math

graph LR
    subgraph Current["Current (broken)"]
        C1["12 pods x 1 concurrent = 12 max workflows"]
        C2["3Gi limit < actual need = OOMKill"]
    end

    subgraph Proposed["Proposed (short-term)"]
        P1["40 pods x 3 concurrent = 120 max workflows"]
        P2["10Gi limit / 3 concurrent ≈ 3.3Gi per workflow + headroom"]
    end

    Current -->|fix| Proposed
Loading

Proposed Short-Term Fix

Change Current Proposed Rationale
worker_concurrency 1 3 Balance throughput vs memory
worker_memory_limit 3Gi 10Gi ~3.3Gi per workflow slot + headroom
worker_memory_request 1.5Gi 4Gi Guarantee scheduling
keda_min_replicas 12 40 Meet EY adoption demand
keda_max_replicas 96 100 Slight increase
Liveness probe enabled removed Prevents killing long-running healthy pods
Readiness probe enabled kept Still needed for traffic routing

Capacity after fix:

  • Min throughput: 40 pods x 3 = 120 concurrent workflows
  • Max throughput: 100 pods x 3 = 300 concurrent workflows
  • Memory per workflow slot: 10Gi / 3 = 3.3Gi (with OS/runtime overhead, safe for most workflows)

Gap: No Memory-Aware Scheduling

The worker's concurrency gate is count-based only. The BRPOP loop checks a semaphore (asyncio.Semaphore(WORKER_MAX_CONCURRENCY)) to decide whether to accept more work. It never checks memory.

flowchart TD
    BRPOP["BRPOP: job available\non Redis queue"] --> SEM{"Semaphore:\nslots < max\nconcurrency?"}
    SEM -->|"Yes (slots free)"| ACCEPT["Accept job\nStart workflow"]
    SEM -->|"No (all slots busy)"| WAIT["Block until\nslot opens"]

    ACCEPT --> EXEC["Workflow executing...\nmemory growing"]
    EXEC --> CHECK{"Is pod at\n2.99Gi RAM?"}
    CHECK -->|"Nobody checks"| NEXT["BRPOP again\n(if slots free)"]
    NEXT --> SEM
    CHECK -->|"Still nobody checks"| OOM["3rd workflow pushes\npast 3Gi → OOMKill"]

    style CHECK fill:#ff6b6b,color:#fff
    style OOM fill:#ff6b6b,color:#fff
Loading

Example with concurrency=5, memory_limit=10Gi:

  1. Pod accepts workflow A → RSS grows to 3Gi
  2. Semaphore says 4 slots free → accepts workflow B → RSS now 5.5Gi
  3. Semaphore says 3 slots free → accepts workflow C → RSS now 8Gi
  4. Semaphore says 2 slots free → accepts workflow D → RSS pushes past 10Gi → OOMKill

The _get_rss_bytes() function already exists in the worker and is called after every job for metrics. It just isn't consulted before accepting work.

Proposed Fix: Memory-Gated BRPOP

Add a memory check before the semaphore acquire in the BRPOP loop:

MEMORY_PRESSURE_THRESHOLD = 0.75  # 75% of cgroup limit

def _get_cgroup_limit() -> int:
    """Read container memory limit from cgroup v2."""
    try:
        with open("/sys/fs/cgroup/memory.max") as f:
            val = f.read().strip()
            return int(val) if val != "max" else float('inf')
    except FileNotFoundError:
        return float('inf')  # not in a container

def _under_memory_pressure() -> bool:
    """Check if container memory usage is above threshold."""
    rss = _get_rss_bytes()
    limit = _get_cgroup_limit()
    return rss / limit > MEMORY_PRESSURE_THRESHOLD

# In the BRPOP loop:
while True:
    if _under_memory_pressure():
        logger.warning("Memory pressure: %.1f%% of limit, skipping BRPOP",
                       (_get_rss_bytes() / _get_cgroup_limit()) * 100)
        await asyncio.sleep(5)  # back off, let running workflows finish
        continue
    await semaphore.acquire()
    job = await redis.brpop("workflow_jobs", timeout=5)
    ...
flowchart TD
    BRPOP["BRPOP loop iteration"] --> MEM{"RSS > 75% of\ncgroup limit?"}
    MEM -->|"Yes (memory pressure)"| BACK["Sleep 5s\nSkip this iteration\nLet running jobs finish"]
    MEM -->|"No (headroom available)"| SEM{"Semaphore:\nslots < max?"}

    SEM -->|Yes| ACCEPT["Accept job"]
    SEM -->|No| WAIT["Block until slot opens"]
    BACK --> BRPOP

    ACCEPT --> EXEC["Workflow executes"]
    EXEC --> DONE["Job completes\nRSS may drop via GC"]
    DONE --> BRPOP

    style MEM fill:#51cf66,color:#fff
    style BACK fill:#ffa94d,color:#fff
Loading

Key behaviors:

  • Pod stays healthy — it just stops accepting new work when under pressure
  • Running workflows continue unaffected
  • KEDA sees the Redis list growing (unprocessed jobs) and scales up more pods
  • Once running workflows complete and RSS drops, the pod resumes accepting work
  • Works alongside the concurrency semaphore, not replacing it

Limitation: _get_rss_bytes() uses RUSAGE_SELF which only measures the Python process, not child processes or page cache. A more accurate check would read /sys/fs/cgroup/memory.current for true container memory usage:

def _get_container_memory() -> int:
    """Read actual container memory from cgroup v2 (includes children + cache)."""
    try:
        with open("/sys/fs/cgroup/memory.current") as f:
            return int(f.read().strip())
    except FileNotFoundError:
        return _get_rss_bytes()  # fallback to process RSS

Long-Term Recommendations

  1. Memory-gated BRPOP (described above) — Check container memory before accepting new work. Lightweight, no infrastructure changes, addresses the root scheduling gap.
  2. Dead-letter queue for poison pills — Track re-enqueue count per job_id. After N failures (e.g., 3), move to dead-letter queue instead of re-enqueuing. Without it, increasing resources just makes poison pills take longer to kill each pod.
  3. Container-level memory monitoring — Add cAdvisor/kubelet metrics scraping to the namespace Prometheus so we can see actual container memory (not just Python RSS). Current blind spot: the ~2GB of non-Python memory is invisible.
  4. Per-workflow memory tracking — Instrument RSS/PSS per workflow to identify outlier workflows before they OOMKill. The existing workflow_job_memory_delta_bytes histogram only captures Python heap changes.
  5. Workflow-level resource hints — Allow workflow definitions to declare expected resource class (small/medium/large) and route to appropriately sized worker pools.
  6. Temporal migration — Replace Redis BRPOP + recovery sweeper with Temporal's built-in workflow orchestration, retry policies, and heartbeat management.

Remediation Plan — Jira Tickets

All application-level fixes tracked under epic ENG-1161. Terraform infra changes (10Gi limit, concurrency=3, remove liveness probe, bump KEDA min replicas) are separate.

# Ticket Summary Priority Rationale
Epic ENG-1161 Workflow Worker Memory Consumption Fixes High Parent epic linking all remediation work to this incident
1 ENG-1162 Dead-letter queue for poison-pill workflows + load test enhancements High Stops infinite OOMKill loop — without this, increasing memory limits just makes poison pills take longer to kill each pod. Also adds mock-LLM load test mode and memory-heavy fixtures to validate all subsequent fixes.
2 ENG-1163 Memory-gated BRPOP — check container memory before accepting work High Prevents OOM stacking — pod stops accepting new work when container memory exceeds 75% of cgroup limit. KEDA sees queue growing and scales up instead.
3 ENG-1164 Intermediate node result eviction in WorkflowExecutor High Core memory reduction — currently ALL node outputs stay in memory for entire workflow duration. An 11-node workflow holds all outputs simultaneously. Evicting after downstream consumption could cut peak memory 50-70%.
4 ENG-1165 Streaming file parsing for large input files (xlsx, pptx, pdf) Medium Directly addresses the 3 worst offenders with 6-7MB input files. openpyxl/pandas inflates a 6MB xlsx to 50-100MB in-memory.
5 ENG-1166 Output spilling to Redis for large node outputs Medium Moves large payloads out of Python heap into Redis. Extends the existing SerializingDataStore pattern in data_store.py.
6 ENG-1167 Subprocess memory budgets for document generation (node.js) Medium Addresses the ~2GB blind spot — node.js child processes are invisible to RUSAGE_SELF but counted against cgroup. Caps child heap + switches metrics to cgroup-based measurement.

Sequencing

  • Ship together (short-term): ENG-1162 (DLQ + load test) + ENG-1163 (memory gate) + Terraform infra changes
  • Follow-up (memory reduction): ENG-1164 → ENG-1165 → ENG-1166 → ENG-1167, each validated with the load test from ENG-1162

Open Questions

  • What specific workflow types consume the most non-Python memory? Is it document generation subprocesses (node.js), large S3 file downloads held in page cache, or both?
  • Should we implement the dead-letter queue before increasing concurrency to prevent poison-pill loops? Yes — ENG-1162 ships with infra changes.
  • Is removing the liveness probe safe long-term, or should we make it workflow-aware (e.g., longer timeout, check heartbeat instead of HTTP)?
  • Do ey-ap-southeast-1 and ey-us-east-2 have the same OOMKill pattern? (same config: concurrency=1, 3Gi limit)
  • Can we add RUSAGE_CHILDREN tracking to capture subprocess memory alongside RUSAGE_SELF?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment