Date: 2026-04-14 Duration: ~05:00 - 07:00 CST (11:00 - 13:00 UTC) Environment: ey-eu-west-1 Severity: S2 — User-facing workflow delays, incomplete executions Status: Investigating
gantt
title Incident Timeline (CST)
dateFormat HH:mm
axisFormat %H:%M
section Trigger
EU business hours begin / workflow burst :crit, 05:00, 06:00
section Detection
Queue depth rising, workflows stalling :active, 05:15, 06:30
OOMKill pods cycling (all 12 pods) :crit, 05:20, 07:00
section Response
Jordan alerted via phone call :milestone, 06:50, 0min
Manual pod count increase :06:50, 07:00
Queue begins draining :07:00, 07:30
section Recovery
Queue fully drained :07:30, 08:00
flowchart LR
subgraph API["API Pods"]
A[FastAPI Router]
end
subgraph Redis
Q["workflow_jobs\n(Redis List)"]
H["worker:jobs:{pod}\n(Heartbeat Hash)"]
end
subgraph Workers["Worker Pods (12 min, 96 max)"]
W1["Pod 1\nconcurrency=1\nmem_limit=3Gi"]
W2["Pod 2\nconcurrency=1\nmem_limit=3Gi"]
W3["Pod N..."]
end
subgraph KEDA["KEDA Autoscaler"]
K1["Redis list length trigger\nthreshold=1"]
K2["Prometheus active jobs trigger\nthreshold=1"]
end
A -->|LPUSH| Q
Q -->|BRPOP| W1
Q -->|BRPOP| W2
Q -->|BRPOP| W3
W1 -->|heartbeat refresh| H
W2 -->|heartbeat refresh| H
Q --> K1
K2 -->|"sum(workflow_jobs_active)"| Workers
KEDA -->|scale| Workers
flowchart TD
WF["Workflow submitted\n(memory-hungry)"] --> BRPOP["Pod picks up job\nvia BRPOP"]
BRPOP --> EXEC["Workflow executing\nmemory climbing"]
EXEC --> OOM{"Container memory > 3Gi\nlimit?"}
OOM -->|Yes| SIGKILL["SIGKILL (code 137)\nNo finally block\nNo cleanup"]
OOM -->|No| LIVE{"Health server\nresponsive?"}
LIVE -->|No: event loop blocked| LIVEKILL["Liveness probe timeout\nK8s kills pod"]
LIVE -->|Yes| COMPLETE["Workflow completes\nnormally"]
SIGKILL --> PARTIAL["Partial results\n3/5 nodes persisted\nStatus stuck 'running'"]
LIVEKILL --> PARTIAL
PARTIAL --> HEARTBEAT["Heartbeat expires\n(TTL=120s, no refresh)"]
HEARTBEAT --> SWEEP["Recovery sweeper\nre-enqueues job"]
SWEEP --> BRPOP
style SIGKILL fill:#ff6b6b,color:#fff
style LIVEKILL fill:#ff6b6b,color:#fff
style PARTIAL fill:#ffa94d,color:#fff
style SWEEP fill:#ffa94d,color:#fff
A single workflow can consume enough memory to push container RSS past the 3Gi limit. Even with concurrency=1, a memory-hungry workflow exceeds the cgroup limit and gets OOMKilled (SIGKILL, code 137).
The SIGKILL bypasses Python's finally block in the executor, so:
persist_node_results()never runsupdate_workflow_run_status()never runs- Workflow status stays "running" or gets incorrectly marked by a later recovery attempt
The recovery sweeper detects the dead heartbeat after 120s and re-enqueues the same job. Another pod picks it up, hits the same memory wall, gets OOMKilled. Infinite loop.
Long-running workflows (even those within memory limits) can saturate the asyncio event loop. The aiohttp health server on port 8080 goes unresponsive. K8s liveness probe times out after 3 consecutive failures (30s period x 3 = 90s window) and kills the pod. Same cascade as OOM: no cleanup, heartbeat dies, sweeper re-enqueues.
Users see 3/5 nodes completed, then nothing. Because SIGKILL prevents the finally block:
- Nodes that completed before the kill have results (persisted via WebSocket/Redis during execution)
- Remaining nodes never execute
- No failure notification reaches the user
All 12 worker pods show OOMKilled or Exit Code 137 as their last termination reason. 398 total restarts across the fleet in ~4.5 days:
| Pod | Restarts | Last Reason | Last OOMKill Time (UTC) |
|---|---|---|---|
h4nzf |
62 | OOMKilled | 14:44 |
7vvj5 |
56 | OOMKilled | 15:32 |
lrv5w |
55 | OOMKilled | 13:50 |
682fx |
53 | OOMKilled | 14:58 |
9bfxc |
48 | Error (137) | 12:40 |
j7fgp |
43 | OOMKilled | 13:36 |
5zj7j |
40 | OOMKilled | 13:43 |
n4qhv |
29 | OOMKilled | 14:24 |
dd2d9 |
6 | Error (137) | 15:26 |
knf8t |
3 | OOMKilled | 13:29 |
xqm7q |
2 | OOMKilled | 15:26 |
lzx8s |
1 | OOMKilled | 13:08 |
The worker exposes workflow_worker_rss_bytes (Python process peak RSS via getrusage(RUSAGE_SELF).ru_maxrss) and workflow_job_memory_delta_bytes (RSS change per workflow job).
| Pod IP | Peak RSS | Note |
|---|---|---|
| 10.0.11.148 | 1.01 GB | Sustained 1.01GB from 11:27-14:49 UTC |
| 10.0.2.215 | 1.00 GB | |
| 10.0.2.107 | 0.99 GB | |
| 10.0.10.154 | 0.99 GB | |
| 10.0.10.172 | 0.94 GB | |
| 10.0.2.248 | 0.85 GB | |
| 10.0.1.219 | 0.80 GB | |
| 10.0.0.110 | 0.65 GB |
| Pod IP | P99 Memory Delta | Note |
|---|---|---|
| 10.0.2.107 | 995 MB | Single workflow added ~1GB to process RSS |
| 10.0.2.215 | 990 MB | |
| 10.0.10.154 | 985 MB | |
| 10.0.11.148 | 975 MB | |
| 10.0.2.248 | 950 MB | |
| 10.0.10.172 | 940 MB |
07:37 UTC: 0.45 GB =========
07:53 UTC: 0.00 GB (pod restarted after OOMKill)
08:53 UTC: 0.35 GB ======= (baseline after restart)
11:23 UTC: 0.37 GB =======
11:27 UTC: 1.01 GB ==================== ← workflow starts, +640MB spike
11:28-14:49: 1.01 GB sustained (memory never freed — Python heap fragmentation)
14:50 UTC: 0.99 GB
14:54 UTC: 0.00 GB (pod OOMKilled again)
workflow_worker_rss_bytes measures only the Python process RSS (getrusage(RUSAGE_SELF)). The container cgroup memory limit (3Gi) includes:
- Python process RSS (~1GB peak observed)
- Shared libraries and mmap'd files
- Page cache from file I/O (document generation, S3 downloads)
- Kernel overhead and slab cache
- Memory from subprocess calls (e.g., node.js for document generation)
The ~2GB gap between observed Python RSS (1GB) and the 3Gi OOM threshold is consumed by these non-Python allocations. For the pods that did OOMKill, we can't see their final RSS because the Prometheus scrape (15-30s interval) misses the spike — the pod dies before the next scrape.
These workflows appeared on OOMKilled pods and were re-enqueued multiple times:
| Workflow | Workflow ID | Times Seen | Session |
|---|---|---|---|
| Nursery AI Agent | 0b84cd38-5c5e-4280-859d-893252a8a3f0 |
5x | 09f8c491 |
| EYP-Valutation-Approach Mapping | 55811da3-0b6f-4548-8462-beb2608349af |
3x | |
| Proposal development Nordics | ff90fc3a-41d6-4ef6-929c-bddc4e305ff4 |
3x | |
| Agentic workflow Carvature Transactions | 85ebc6c8-3386-4f2e-98ef-388a45d76383 |
2x | |
| Competition Workflow (Research agent) | a73e7534-9a26-48ec-ac96-f940cd61c5a6 |
2x | |
| Deal Intelligence | 211339d8-7f27-43f5-b4a5-33ce50cac69a |
2x |
Nursery AI Agent is the clearest poison pill — 5 appearances across multiple OOMKilled pods in a single session.
All five workflows belong to org 8370c192 (EY). Database analysis of the latest workflow version payloads reveals what makes them memory-heavy.
| Workflow | Nodes | Input Files | Total Input Size | Output Node | Model |
|---|---|---|---|---|---|
| Nursery AI Agent | 3 | Ofsted report 200 entries.xlsx (6.2MB), PreK List.xlsx (1.3MB) |
7.5 MB | Data Analysis Agent | Opus |
| Proposal dev Nordics | 3 | EY-Parthenon...Proposal.pptx (6.4MB), latest_word_document.pdf (616KB) |
7.1 MB | EY PowerPoint | Opus |
| EYP-Valutation Mapping | 8 | Execution approach.pptx (4MB), Kabanga RFP.pdf (2.2MB), playbook.docx (84KB) |
6.3 MB | Data Analysis + 2 Agents + 2 Human Intervention | Opus + Sonnet |
| Competition Workflow | 11 | Competition template.xlsx (26KB) |
26 KB | 3 Agents + Spreadsheet + 6 tool nodes (Serpapi, Exa, Elasticsearch, Composio) | Opus |
| Deal Intelligence | 10 | Opportunities screening.xlsx (22KB) |
22 KB | Agent + Spreadsheet + 8 tool nodes | Opus |
| Run ID | Status | Duration | Node Results | Output Size |
|---|---|---|---|---|
6bd8b67f |
FAILED | 4h 35m | 2 completed, 1 failed | 730 KB |
48ee130b |
SUCCESS | 4m 3s | 3 completed | 757 KB |
31055bad |
FAILED | 10s | 2 completed, 1 failed | 755 KB |
af57641d |
RUNNING | stuck | 0 results | 0 bytes |
The 6bd8b67f run ran for 4.5 hours before failing — a single workflow occupying one pod for that entire duration. The af57641d run is still stuck in RUNNING status (orphaned after OOMKill, never cleaned up).
Four distinct patterns cause high memory:
1. Large file parsing (Nursery, Proposal dev, EYP-Valutation)
- 6-7MB input files (xlsx, pptx, pdf) are downloaded from S3 and parsed entirely into Python memory
- xlsx parsing via openpyxl/pandas can inflate a 6MB file to 50-100MB in-memory representation
- The parsed content is then serialized into the LLM prompt, creating another large string allocation
2. Document generation subprocesses (Proposal dev, EYP-Valutation)
ey_powerpoint_chatanddata_analysisnodes spawn node.js child processes for document generation- Child process memory is counted against the container cgroup but invisible to
RUSAGE_SELF - This explains the 2GB gap between Python RSS (1GB) and OOMKill threshold (3Gi)
3. Multi-tool accumulation (Competition, Deal Intelligence)
- 6-8 external API tool calls (Serpapi, Exa, Elasticsearch, Composio) each return response data
- All responses are held in memory as part of the workflow execution context until completion
- 11 nodes × average response size compounds into significant memory pressure
4. Opus model context windows
- All five workflows use Claude Opus (except 2 agent nodes in EYP-Valutation using Sonnet)
- Opus supports larger context → larger request/response payloads held in memory during API calls
- Streaming responses accumulate in buffers before being written to node results
| Setting | Value | Notes |
|---|---|---|
worker_concurrency |
1 | Intentionally low to limit per-pod memory |
worker_memory_limit |
3Gi | Insufficient for large workflows |
worker_memory_request |
1.5Gi | |
worker_cpu_limit |
1.5 | |
worker_cpu_request |
0.5 | |
keda_min_replicas |
12 | |
keda_max_replicas |
96 | |
keda_cooldown_period |
120s | |
| Liveness probe | enabled | /healthz on port 8080, 30s period, 3 failures |
| Readiness probe | enabled | /healthz on port 8080, 15s period, 3 failures |
termination_grace_period |
300s | Irrelevant for SIGKILL (OOM) |
| Prometheus retention | 15d | Application-level only (no cAdvisor) |
graph LR
subgraph Current["Current (broken)"]
C1["12 pods x 1 concurrent = 12 max workflows"]
C2["3Gi limit < actual need = OOMKill"]
end
subgraph Proposed["Proposed (short-term)"]
P1["40 pods x 3 concurrent = 120 max workflows"]
P2["10Gi limit / 3 concurrent ≈ 3.3Gi per workflow + headroom"]
end
Current -->|fix| Proposed
| Change | Current | Proposed | Rationale |
|---|---|---|---|
worker_concurrency |
1 | 3 | Balance throughput vs memory |
worker_memory_limit |
3Gi | 10Gi | ~3.3Gi per workflow slot + headroom |
worker_memory_request |
1.5Gi | 4Gi | Guarantee scheduling |
keda_min_replicas |
12 | 40 | Meet EY adoption demand |
keda_max_replicas |
96 | 100 | Slight increase |
| Liveness probe | enabled | removed | Prevents killing long-running healthy pods |
| Readiness probe | enabled | kept | Still needed for traffic routing |
- Min throughput: 40 pods x 3 = 120 concurrent workflows
- Max throughput: 100 pods x 3 = 300 concurrent workflows
- Memory per workflow slot: 10Gi / 3 = 3.3Gi (with OS/runtime overhead, safe for most workflows)
The worker's concurrency gate is count-based only. The BRPOP loop checks a semaphore (asyncio.Semaphore(WORKER_MAX_CONCURRENCY)) to decide whether to accept more work. It never checks memory.
flowchart TD
BRPOP["BRPOP: job available\non Redis queue"] --> SEM{"Semaphore:\nslots < max\nconcurrency?"}
SEM -->|"Yes (slots free)"| ACCEPT["Accept job\nStart workflow"]
SEM -->|"No (all slots busy)"| WAIT["Block until\nslot opens"]
ACCEPT --> EXEC["Workflow executing...\nmemory growing"]
EXEC --> CHECK{"Is pod at\n2.99Gi RAM?"}
CHECK -->|"Nobody checks"| NEXT["BRPOP again\n(if slots free)"]
NEXT --> SEM
CHECK -->|"Still nobody checks"| OOM["3rd workflow pushes\npast 3Gi → OOMKill"]
style CHECK fill:#ff6b6b,color:#fff
style OOM fill:#ff6b6b,color:#fff
Example with concurrency=5, memory_limit=10Gi:
- Pod accepts workflow A → RSS grows to 3Gi
- Semaphore says 4 slots free → accepts workflow B → RSS now 5.5Gi
- Semaphore says 3 slots free → accepts workflow C → RSS now 8Gi
- Semaphore says 2 slots free → accepts workflow D → RSS pushes past 10Gi → OOMKill
The _get_rss_bytes() function already exists in the worker and is called after every job for metrics. It just isn't consulted before accepting work.
Add a memory check before the semaphore acquire in the BRPOP loop:
MEMORY_PRESSURE_THRESHOLD = 0.75 # 75% of cgroup limit
def _get_cgroup_limit() -> int:
"""Read container memory limit from cgroup v2."""
try:
with open("/sys/fs/cgroup/memory.max") as f:
val = f.read().strip()
return int(val) if val != "max" else float('inf')
except FileNotFoundError:
return float('inf') # not in a container
def _under_memory_pressure() -> bool:
"""Check if container memory usage is above threshold."""
rss = _get_rss_bytes()
limit = _get_cgroup_limit()
return rss / limit > MEMORY_PRESSURE_THRESHOLD
# In the BRPOP loop:
while True:
if _under_memory_pressure():
logger.warning("Memory pressure: %.1f%% of limit, skipping BRPOP",
(_get_rss_bytes() / _get_cgroup_limit()) * 100)
await asyncio.sleep(5) # back off, let running workflows finish
continue
await semaphore.acquire()
job = await redis.brpop("workflow_jobs", timeout=5)
...flowchart TD
BRPOP["BRPOP loop iteration"] --> MEM{"RSS > 75% of\ncgroup limit?"}
MEM -->|"Yes (memory pressure)"| BACK["Sleep 5s\nSkip this iteration\nLet running jobs finish"]
MEM -->|"No (headroom available)"| SEM{"Semaphore:\nslots < max?"}
SEM -->|Yes| ACCEPT["Accept job"]
SEM -->|No| WAIT["Block until slot opens"]
BACK --> BRPOP
ACCEPT --> EXEC["Workflow executes"]
EXEC --> DONE["Job completes\nRSS may drop via GC"]
DONE --> BRPOP
style MEM fill:#51cf66,color:#fff
style BACK fill:#ffa94d,color:#fff
Key behaviors:
- Pod stays healthy — it just stops accepting new work when under pressure
- Running workflows continue unaffected
- KEDA sees the Redis list growing (unprocessed jobs) and scales up more pods
- Once running workflows complete and RSS drops, the pod resumes accepting work
- Works alongside the concurrency semaphore, not replacing it
Limitation: _get_rss_bytes() uses RUSAGE_SELF which only measures the Python process, not child processes or page cache. A more accurate check would read /sys/fs/cgroup/memory.current for true container memory usage:
def _get_container_memory() -> int:
"""Read actual container memory from cgroup v2 (includes children + cache)."""
try:
with open("/sys/fs/cgroup/memory.current") as f:
return int(f.read().strip())
except FileNotFoundError:
return _get_rss_bytes() # fallback to process RSS- Memory-gated BRPOP (described above) — Check container memory before accepting new work. Lightweight, no infrastructure changes, addresses the root scheduling gap.
- Dead-letter queue for poison pills — Track re-enqueue count per
job_id. After N failures (e.g., 3), move to dead-letter queue instead of re-enqueuing. Without it, increasing resources just makes poison pills take longer to kill each pod. - Container-level memory monitoring — Add cAdvisor/kubelet metrics scraping to the namespace Prometheus so we can see actual container memory (not just Python RSS). Current blind spot: the ~2GB of non-Python memory is invisible.
- Per-workflow memory tracking — Instrument RSS/PSS per workflow to identify outlier workflows before they OOMKill. The existing
workflow_job_memory_delta_byteshistogram only captures Python heap changes. - Workflow-level resource hints — Allow workflow definitions to declare expected resource class (small/medium/large) and route to appropriately sized worker pools.
- Temporal migration — Replace Redis BRPOP + recovery sweeper with Temporal's built-in workflow orchestration, retry policies, and heartbeat management.
All application-level fixes tracked under epic ENG-1161. Terraform infra changes (10Gi limit, concurrency=3, remove liveness probe, bump KEDA min replicas) are separate.
| # | Ticket | Summary | Priority | Rationale |
|---|---|---|---|---|
| Epic | ENG-1161 | Workflow Worker Memory Consumption Fixes | High | Parent epic linking all remediation work to this incident |
| 1 | ENG-1162 | Dead-letter queue for poison-pill workflows + load test enhancements | High | Stops infinite OOMKill loop — without this, increasing memory limits just makes poison pills take longer to kill each pod. Also adds mock-LLM load test mode and memory-heavy fixtures to validate all subsequent fixes. |
| 2 | ENG-1163 | Memory-gated BRPOP — check container memory before accepting work | High | Prevents OOM stacking — pod stops accepting new work when container memory exceeds 75% of cgroup limit. KEDA sees queue growing and scales up instead. |
| 3 | ENG-1164 | Intermediate node result eviction in WorkflowExecutor | High | Core memory reduction — currently ALL node outputs stay in memory for entire workflow duration. An 11-node workflow holds all outputs simultaneously. Evicting after downstream consumption could cut peak memory 50-70%. |
| 4 | ENG-1165 | Streaming file parsing for large input files (xlsx, pptx, pdf) | Medium | Directly addresses the 3 worst offenders with 6-7MB input files. openpyxl/pandas inflates a 6MB xlsx to 50-100MB in-memory. |
| 5 | ENG-1166 | Output spilling to Redis for large node outputs | Medium | Moves large payloads out of Python heap into Redis. Extends the existing SerializingDataStore pattern in data_store.py. |
| 6 | ENG-1167 | Subprocess memory budgets for document generation (node.js) | Medium | Addresses the ~2GB blind spot — node.js child processes are invisible to RUSAGE_SELF but counted against cgroup. Caps child heap + switches metrics to cgroup-based measurement. |
- Ship together (short-term): ENG-1162 (DLQ + load test) + ENG-1163 (memory gate) + Terraform infra changes
- Follow-up (memory reduction): ENG-1164 → ENG-1165 → ENG-1166 → ENG-1167, each validated with the load test from ENG-1162
- What specific workflow types consume the most non-Python memory? Is it document generation subprocesses (node.js), large S3 file downloads held in page cache, or both?
-
Should we implement the dead-letter queue before increasing concurrency to prevent poison-pill loops?Yes — ENG-1162 ships with infra changes. - Is removing the liveness probe safe long-term, or should we make it workflow-aware (e.g., longer timeout, check heartbeat instead of HTTP)?
- Do
ey-ap-southeast-1andey-us-east-2have the same OOMKill pattern? (same config: concurrency=1, 3Gi limit) - Can we add
RUSAGE_CHILDRENtracking to capture subprocess memory alongsideRUSAGE_SELF?