jordotech/incident-report.md

Incident Report: ey-eu-west-1 Workflow Queue Pile-Up

Date: 2026-04-14 Duration: ~05:00 - 07:00 CST (11:00 - 13:00 UTC) Environment: ey-eu-west-1 Severity: S2 — User-facing workflow delays, incomplete executions Status: Investigating

Timeline

gantt
    title Incident Timeline (CST)
    dateFormat HH:mm
    axisFormat %H:%M

    section Trigger
    EU business hours begin / workflow burst    :crit, 05:00, 06:00

    section Detection
    Queue depth rising, workflows stalling      :active, 05:15, 06:30
    OOMKill pods cycling (all 12 pods)          :crit, 05:20, 07:00

    section Response
    Jordan alerted via phone call               :milestone, 06:50, 0min
    Manual pod count increase                   :06:50, 07:00
    Queue begins draining                       :07:00, 07:30

    section Recovery
    Queue fully drained                         :07:30, 08:00

Architecture Context

flowchart LR
    subgraph API["API Pods"]
        A[FastAPI Router]
    end

    subgraph Redis
        Q["workflow_jobs\n(Redis List)"]
        H["worker:jobs:{pod}\n(Heartbeat Hash)"]
    end

    subgraph Workers["Worker Pods (12 min, 96 max)"]
        W1["Pod 1\nconcurrency=1\nmem_limit=3Gi"]
        W2["Pod 2\nconcurrency=1\nmem_limit=3Gi"]
        W3["Pod N..."]
    end

    subgraph KEDA["KEDA Autoscaler"]
        K1["Redis list length trigger\nthreshold=1"]
        K2["Prometheus active jobs trigger\nthreshold=1"]
    end

    A -->|LPUSH| Q
    Q -->|BRPOP| W1
    Q -->|BRPOP| W2
    Q -->|BRPOP| W3
    W1 -->|heartbeat refresh| H
    W2 -->|heartbeat refresh| H

    Q --> K1
    K2 -->|"sum(workflow_jobs_active)"| Workers
    KEDA -->|scale| Workers

Root Cause: Three Interacting Failures

flowchart TD
    WF["Workflow submitted\n(memory-hungry)"] --> BRPOP["Pod picks up job\nvia BRPOP"]
    BRPOP --> EXEC["Workflow executing\nmemory climbing"]

    EXEC --> OOM{"Container memory > 3Gi\nlimit?"}
    OOM -->|Yes| SIGKILL["SIGKILL (code 137)\nNo finally block\nNo cleanup"]
    OOM -->|No| LIVE{"Health server\nresponsive?"}

    LIVE -->|No: event loop blocked| LIVEKILL["Liveness probe timeout\nK8s kills pod"]
    LIVE -->|Yes| COMPLETE["Workflow completes\nnormally"]

    SIGKILL --> PARTIAL["Partial results\n3/5 nodes persisted\nStatus stuck 'running'"]
    LIVEKILL --> PARTIAL

    PARTIAL --> HEARTBEAT["Heartbeat expires\n(TTL=120s, no refresh)"]
    HEARTBEAT --> SWEEP["Recovery sweeper\nre-enqueues job"]
    SWEEP --> BRPOP

    style SIGKILL fill:#ff6b6b,color:#fff
    style LIVEKILL fill:#ff6b6b,color:#fff
    style PARTIAL fill:#ffa94d,color:#fff
    style SWEEP fill:#ffa94d,color:#fff

Problem 1: OOM-Kill + Recovery = Poison Pill Loop

A single workflow can consume enough memory to push container RSS past the 3Gi limit. Even with concurrency=1, a memory-hungry workflow exceeds the cgroup limit and gets OOMKilled (SIGKILL, code 137).

The SIGKILL bypasses Python's finally block in the executor, so:

persist_node_results() never runs
update_workflow_run_status() never runs
Workflow status stays "running" or gets incorrectly marked by a later recovery attempt

The recovery sweeper detects the dead heartbeat after 120s and re-enqueues the same job. Another pod picks it up, hits the same memory wall, gets OOMKilled. Infinite loop.

Problem 2: Liveness Probe Kills Healthy Pods

Long-running workflows (even those within memory limits) can saturate the asyncio event loop. The aiohttp health server on port 8080 goes unresponsive. K8s liveness probe times out after 3 consecutive failures (30s period x 3 = 90s window) and kills the pod. Same cascade as OOM: no cleanup, heartbeat dies, sweeper re-enqueues.

Problem 3: Incomplete Workflow Results

Users see 3/5 nodes completed, then nothing. Because SIGKILL prevents the finally block:

Nodes that completed before the kill have results (persisted via WebSocket/Redis during execution)
Remaining nodes never execute
No failure notification reaches the user

Evidence: Pod OOMKill Data (kubectl)

All 12 worker pods show OOMKilled or Exit Code 137 as their last termination reason. 398 total restarts across the fleet in ~4.5 days:

Pod	Restarts	Last Reason	Last OOMKill Time (UTC)
`h4nzf`	62	OOMKilled	14:44
`7vvj5`	56	OOMKilled	15:32
`lrv5w`	55	OOMKilled	13:50
`682fx`	53	OOMKilled	14:58
`9bfxc`	48	Error (137)	12:40
`j7fgp`	43	OOMKilled	13:36
`5zj7j`	40	OOMKilled	13:43
`n4qhv`	29	OOMKilled	14:24
`dd2d9`	6	Error (137)	15:26
`knf8t`	3	OOMKilled	13:29
`xqm7q`	2	OOMKilled	15:26
`lzx8s`	1	OOMKilled	13:08

Evidence: Prometheus Memory Metrics

The worker exposes workflow_worker_rss_bytes (Python process peak RSS via getrusage(RUSAGE_SELF).ru_maxrss) and workflow_job_memory_delta_bytes (RSS change per workflow job).

Python Process RSS (surviving pods, last 6h)

Pod IP	Peak RSS	Note
10.0.11.148	1.01 GB	Sustained 1.01GB from 11:27-14:49 UTC
10.0.2.215	1.00 GB
10.0.2.107	0.99 GB
10.0.10.154	0.99 GB
10.0.10.172	0.94 GB
10.0.2.248	0.85 GB
10.0.1.219	0.80 GB
10.0.0.110	0.65 GB

Per-Job Memory Delta (P99, last 12h)

Pod IP	P99 Memory Delta	Note
10.0.2.107	995 MB	Single workflow added ~1GB to process RSS
10.0.2.215	990 MB
10.0.10.154	985 MB
10.0.11.148	975 MB
10.0.2.248	950 MB
10.0.10.172	940 MB

RSS Timeline (pod 10.0.11.148 — characteristic spike)

07:37 UTC: 0.45 GB  =========
07:53 UTC: 0.00 GB  (pod restarted after OOMKill)
08:53 UTC: 0.35 GB  =======  (baseline after restart)
11:23 UTC: 0.37 GB  =======
11:27 UTC: 1.01 GB  ====================  ← workflow starts, +640MB spike
11:28-14:49: 1.01 GB sustained (memory never freed — Python heap fragmentation)
14:50 UTC: 0.99 GB
14:54 UTC: 0.00 GB  (pod OOMKilled again)

Why Python RSS < 3Gi but pods still OOMKill

workflow_worker_rss_bytes measures only the Python process RSS (getrusage(RUSAGE_SELF)). The container cgroup memory limit (3Gi) includes:

Python process RSS (~1GB peak observed)
Shared libraries and mmap'd files
Page cache from file I/O (document generation, S3 downloads)
Kernel overhead and slab cache
Memory from subprocess calls (e.g., node.js for document generation)

The ~2GB gap between observed Python RSS (1GB) and the 3Gi OOM threshold is consumed by these non-Python allocations. For the pods that did OOMKill, we can't see their final RSS because the Prometheus scrape (15-30s interval) misses the spike — the pod dies before the next scrape.

Evidence: Poison Pill Workflow IDs (from Coralogix logs)

These workflows appeared on OOMKilled pods and were re-enqueued multiple times:

Workflow	Workflow ID	Times Seen	Session
Nursery AI Agent	`0b84cd38-5c5e-4280-859d-893252a8a3f0`	5x	`09f8c491`
EYP-Valutation-Approach Mapping	`55811da3-0b6f-4548-8462-beb2608349af`	3x
Proposal development Nordics	`ff90fc3a-41d6-4ef6-929c-bddc4e305ff4`	3x
Agentic workflow Carvature Transactions	`85ebc6c8-3386-4f2e-98ef-388a45d76383`	2x
Competition Workflow (Research agent)	`a73e7534-9a26-48ec-ac96-f940cd61c5a6`	2x
Deal Intelligence	`211339d8-7f27-43f5-b4a5-33ce50cac69a`	2x

Nursery AI Agent is the clearest poison pill — 5 appearances across multiple OOMKilled pods in a single session.

Evidence: Workflow Composition Analysis (from DB)

All five workflows belong to org 8370c192 (EY). Database analysis of the latest workflow version payloads reveals what makes them memory-heavy.

Workflow Profiles

Workflow	Nodes	Input Files	Total Input Size	Output Node	Model
Nursery AI Agent	3	`Ofsted report 200 entries.xlsx` (6.2MB), `PreK List.xlsx` (1.3MB)	7.5 MB	Data Analysis Agent	Opus
Proposal dev Nordics	3	`EY-Parthenon...Proposal.pptx` (6.4MB), `latest_word_document.pdf` (616KB)	7.1 MB	EY PowerPoint	Opus
EYP-Valutation Mapping	8	`Execution approach.pptx` (4MB), `Kabanga RFP.pdf` (2.2MB), `playbook.docx` (84KB)	6.3 MB	Data Analysis + 2 Agents + 2 Human Intervention	Opus + Sonnet
Competition Workflow	11	`Competition template.xlsx` (26KB)	26 KB	3 Agents + Spreadsheet + 6 tool nodes (Serpapi, Exa, Elasticsearch, Composio)	Opus
Deal Intelligence	10	`Opportunities screening.xlsx` (22KB)	22 KB	Agent + Spreadsheet + 8 tool nodes	Opus

Run History (Nursery AI Agent — worst offender)

Run ID	Status	Duration	Node Results	Output Size
`6bd8b67f`	FAILED	4h 35m	2 completed, 1 failed	730 KB
`48ee130b`	SUCCESS	4m 3s	3 completed	757 KB
`31055bad`	FAILED	10s	2 completed, 1 failed	755 KB
`af57641d`	RUNNING	stuck	0 results	0 bytes

The 6bd8b67f run ran for 4.5 hours before failing — a single workflow occupying one pod for that entire duration. The af57641d run is still stuck in RUNNING status (orphaned after OOMKill, never cleaned up).

Memory Consumption Patterns

Four distinct patterns cause high memory:

1. Large file parsing (Nursery, Proposal dev, EYP-Valutation)

6-7MB input files (xlsx, pptx, pdf) are downloaded from S3 and parsed entirely into Python memory
xlsx parsing via openpyxl/pandas can inflate a 6MB file to 50-100MB in-memory representation
The parsed content is then serialized into the LLM prompt, creating another large string allocation

2. Document generation subprocesses (Proposal dev, EYP-Valutation)

ey_powerpoint_chat and data_analysis nodes spawn node.js child processes for document generation
Child process memory is counted against the container cgroup but invisible to RUSAGE_SELF
This explains the 2GB gap between Python RSS (1GB) and OOMKill threshold (3Gi)

3. Multi-tool accumulation (Competition, Deal Intelligence)

6-8 external API tool calls (Serpapi, Exa, Elasticsearch, Composio) each return response data
All responses are held in memory as part of the workflow execution context until completion
11 nodes × average response size compounds into significant memory pressure

4. Opus model context windows

All five workflows use Claude Opus (except 2 agent nodes in EYP-Valutation using Sonnet)
Opus supports larger context → larger request/response payloads held in memory during API calls
Streaming responses accumulate in buffers before being written to node results

Configuration at Time of Incident

Setting	Value	Notes
`worker_concurrency`	1	Intentionally low to limit per-pod memory
`worker_memory_limit`	3Gi	Insufficient for large workflows
`worker_memory_request`	1.5Gi
`worker_cpu_limit`	1.5
`worker_cpu_request`	0.5
`keda_min_replicas`	12
`keda_max_replicas`	96
`keda_cooldown_period`	120s
Liveness probe	enabled	`/healthz` on port 8080, 30s period, 3 failures
Readiness probe	enabled	`/healthz` on port 8080, 15s period, 3 failures
`termination_grace_period`	300s	Irrelevant for SIGKILL (OOM)
Prometheus retention	15d	Application-level only (no cAdvisor)

Capacity Math

graph LR
    subgraph Current["Current (broken)"]
        C1["12 pods x 1 concurrent = 12 max workflows"]
        C2["3Gi limit < actual need = OOMKill"]
    end

    subgraph Proposed["Proposed (short-term)"]
        P1["40 pods x 3 concurrent = 120 max workflows"]
        P2["10Gi limit / 3 concurrent ≈ 3.3Gi per workflow + headroom"]
    end

    Current -->|fix| Proposed

Proposed Short-Term Fix

Change	Current	Proposed	Rationale
`worker_concurrency`	1	3	Balance throughput vs memory
`worker_memory_limit`	3Gi	10Gi	~3.3Gi per workflow slot + headroom
`worker_memory_request`	1.5Gi	4Gi	Guarantee scheduling
`keda_min_replicas`	12	40	Meet EY adoption demand
`keda_max_replicas`	96	100	Slight increase
Liveness probe	enabled	removed	Prevents killing long-running healthy pods
Readiness probe	enabled	kept	Still needed for traffic routing

Capacity after fix:

Min throughput: 40 pods x 3 = 120 concurrent workflows
Max throughput: 100 pods x 3 = 300 concurrent workflows
Memory per workflow slot: 10Gi / 3 = 3.3Gi (with OS/runtime overhead, safe for most workflows)

Gap: No Memory-Aware Scheduling

The worker's concurrency gate is count-based only. The BRPOP loop checks a semaphore (asyncio.Semaphore(WORKER_MAX_CONCURRENCY)) to decide whether to accept more work. It never checks memory.

flowchart TD
    BRPOP["BRPOP: job available\non Redis queue"] --> SEM{"Semaphore:\nslots < max\nconcurrency?"}
    SEM -->|"Yes (slots free)"| ACCEPT["Accept job\nStart workflow"]
    SEM -->|"No (all slots busy)"| WAIT["Block until\nslot opens"]

    ACCEPT --> EXEC["Workflow executing...\nmemory growing"]
    EXEC --> CHECK{"Is pod at\n2.99Gi RAM?"}
    CHECK -->|"Nobody checks"| NEXT["BRPOP again\n(if slots free)"]
    NEXT --> SEM
    CHECK -->|"Still nobody checks"| OOM["3rd workflow pushes\npast 3Gi → OOMKill"]

    style CHECK fill:#ff6b6b,color:#fff
    style OOM fill:#ff6b6b,color:#fff

Example with concurrency=5, memory_limit=10Gi:

Pod accepts workflow A → RSS grows to 3Gi
Semaphore says 4 slots free → accepts workflow B → RSS now 5.5Gi
Semaphore says 3 slots free → accepts workflow C → RSS now 8Gi
Semaphore says 2 slots free → accepts workflow D → RSS pushes past 10Gi → OOMKill

The _get_rss_bytes() function already exists in the worker and is called after every job for metrics. It just isn't consulted before accepting work.

Proposed Fix: Memory-Gated BRPOP

Add a memory check before the semaphore acquire in the BRPOP loop:

MEMORY_PRESSURE_THRESHOLD = 0.75  # 75% of cgroup limit

def _get_cgroup_limit() -> int:
    """Read container memory limit from cgroup v2."""
    try:
        with open("/sys/fs/cgroup/memory.max") as f:
            val = f.read().strip()
            return int(val) if val != "max" else float('inf')
    except FileNotFoundError:
        return float('inf')  # not in a container

def _under_memory_pressure() -> bool:
    """Check if container memory usage is above threshold."""
    rss = _get_rss_bytes()
    limit = _get_cgroup_limit()
    return rss / limit > MEMORY_PRESSURE_THRESHOLD

# In the BRPOP loop:
while True:
    if _under_memory_pressure():
        logger.warning("Memory pressure: %.1f%% of limit, skipping BRPOP",
                       (_get_rss_bytes() / _get_cgroup_limit()) * 100)
        await asyncio.sleep(5)  # back off, let running workflows finish
        continue
    await semaphore.acquire()
    job = await redis.brpop("workflow_jobs", timeout=5)
    ...

flowchart TD
    BRPOP["BRPOP loop iteration"] --> MEM{"RSS > 75% of\ncgroup limit?"}
    MEM -->|"Yes (memory pressure)"| BACK["Sleep 5s\nSkip this iteration\nLet running jobs finish"]
    MEM -->|"No (headroom available)"| SEM{"Semaphore:\nslots < max?"}

    SEM -->|Yes| ACCEPT["Accept job"]
    SEM -->|No| WAIT["Block until slot opens"]
    BACK --> BRPOP

    ACCEPT --> EXEC["Workflow executes"]
    EXEC --> DONE["Job completes\nRSS may drop via GC"]
    DONE --> BRPOP

    style MEM fill:#51cf66,color:#fff
    style BACK fill:#ffa94d,color:#fff

Key behaviors:

Pod stays healthy — it just stops accepting new work when under pressure
Running workflows continue unaffected
KEDA sees the Redis list growing (unprocessed jobs) and scales up more pods
Once running workflows complete and RSS drops, the pod resumes accepting work
Works alongside the concurrency semaphore, not replacing it

Limitation: _get_rss_bytes() uses RUSAGE_SELF which only measures the Python process, not child processes or page cache. A more accurate check would read /sys/fs/cgroup/memory.current for true container memory usage:

def _get_container_memory() -> int:
    """Read actual container memory from cgroup v2 (includes children + cache)."""
    try:
        with open("/sys/fs/cgroup/memory.current") as f:
            return int(f.read().strip())
    except FileNotFoundError:
        return _get_rss_bytes()  # fallback to process RSS

Long-Term Recommendations

Memory-gated BRPOP (described above) — Check container memory before accepting new work. Lightweight, no infrastructure changes, addresses the root scheduling gap.
Dead-letter queue for poison pills — Track re-enqueue count per job_id. After N failures (e.g., 3), move to dead-letter queue instead of re-enqueuing. Without it, increasing resources just makes poison pills take longer to kill each pod.
Container-level memory monitoring — Add cAdvisor/kubelet metrics scraping to the namespace Prometheus so we can see actual container memory (not just Python RSS). Current blind spot: the ~2GB of non-Python memory is invisible.
Per-workflow memory tracking — Instrument RSS/PSS per workflow to identify outlier workflows before they OOMKill. The existing workflow_job_memory_delta_bytes histogram only captures Python heap changes.
Workflow-level resource hints — Allow workflow definitions to declare expected resource class (small/medium/large) and route to appropriately sized worker pools.
Temporal migration — Replace Redis BRPOP + recovery sweeper with Temporal's built-in workflow orchestration, retry policies, and heartbeat management.

Remediation Plan — Jira Tickets

All application-level fixes tracked under epic ENG-1161. Terraform infra changes (10Gi limit, concurrency=3, remove liveness probe, bump KEDA min replicas) are separate.

#	Ticket	Summary	Priority	Rationale
Epic	ENG-1161	Workflow Worker Memory Consumption Fixes	High	Parent epic linking all remediation work to this incident
1	ENG-1162	Dead-letter queue for poison-pill workflows + load test enhancements	High	Stops infinite OOMKill loop — without this, increasing memory limits just makes poison pills take longer to kill each pod. Also adds mock-LLM load test mode and memory-heavy fixtures to validate all subsequent fixes.
2	ENG-1163	Memory-gated BRPOP — check container memory before accepting work	High	Prevents OOM stacking — pod stops accepting new work when container memory exceeds 75% of cgroup limit. KEDA sees queue growing and scales up instead.
3	ENG-1164	Intermediate node result eviction in WorkflowExecutor	High	Core memory reduction — currently ALL node outputs stay in memory for entire workflow duration. An 11-node workflow holds all outputs simultaneously. Evicting after downstream consumption could cut peak memory 50-70%.
4	ENG-1165	Streaming file parsing for large input files (xlsx, pptx, pdf)	Medium	Directly addresses the 3 worst offenders with 6-7MB input files. openpyxl/pandas inflates a 6MB xlsx to 50-100MB in-memory.
5	ENG-1166	Output spilling to Redis for large node outputs	Medium	Moves large payloads out of Python heap into Redis. Extends the existing `SerializingDataStore` pattern in `data_store.py`.
6	ENG-1167	Subprocess memory budgets for document generation (node.js)	Medium	Addresses the ~2GB blind spot — node.js child processes are invisible to `RUSAGE_SELF` but counted against cgroup. Caps child heap + switches metrics to cgroup-based measurement.

Sequencing

Ship together (short-term): ENG-1162 (DLQ + load test) + ENG-1163 (memory gate) + Terraform infra changes
Follow-up (memory reduction): ENG-1164 → ENG-1165 → ENG-1166 → ENG-1167, each validated with the load test from ENG-1162

Open Questions

What specific workflow types consume the most non-Python memory? Is it document generation subprocesses (node.js), large S3 file downloads held in page cache, or both?
~~Should we implement the dead-letter queue before increasing concurrency to prevent poison-pill loops?~~ Yes — ENG-1162 ships with infra changes.
Is removing the liveness probe safe long-term, or should we make it workflow-aware (e.g., longer timeout, check heartbeat instead of HTTP)?
Do ey-ap-southeast-1 and ey-us-east-2 have the same OOMKill pattern? (same config: concurrency=1, 3Gi limit)
Can we add RUSAGE_CHILDREN tracking to capture subprocess memory alongside RUSAGE_SELF?