jmealo/2026-03-16-rca-missing-intel-requests-created.md

RCA: Missing intel_requests.created (ENG-3349) — Corrected


Status	PROVISIONAL — ASU-specific loss confirmed, root cause unknown
Supersedes	Previous RCA
Incident date	2026-03-10
Analysis date	2026-03-16
Reverted	intel-requests-api !48 (batch_id change from !47)

TL;DR

2,798/300,839 intel requests on March 10 never had facts created (~0.93%).

The previous RCA said all consumers were losing messages due to a shutdown race in amqp-consumer. Wrong. Only ASU is losing them. event-recorder and AIPT received 100%. The previous analysis searched logs by intel_request_id — a field that doesn't exist on the log lines that prove delivery.

Findings

1. Only ASU drops messages

Counted intel_requests.created processing logs per batch per consumer. Each message = 2 log lines (Processing + Processed) from the shared amqp-consumer library (source).

batch_id	event-recorder	AIPT	ASU	ASU lost
`c2a627b3`	174	174	168	6 (3.4%)
`215181db`	22	22	21	1 (4.5%)
`812f79b7`	68	68	66	2 (2.9%)
`d334c245`	64	64	62	2 (3.1%)
`2186cea6`	22	22	20	2 (9.1%)
`7438ef94`	54	54	51	3 (5.6%)
Total	404	404	388	16 (3.96%)

Verify: ASU batch c2a627b3 (expect 336 lines) vs event-recorder (expect 348 lines) vs AIPT (expect 348 lines) — or ASU all 6 batches (expect 776 lines)

Evidence: Raw VictoriaLogs stats query + output

Query (run per service — replace {service} with service name):

service_name:exact("{service}") AND message_type:exact("intel_requests.created")
AND (batch_id:exact("c2a627b3-3f7c-4c89-808c-4d79b19c69cb")
  OR batch_id:exact("215181db-45e2-474f-9c05-839eecf61cc5")
  OR batch_id:exact("812f79b7-c50d-4fd2-ac28-cc5d0318ee23")
  OR batch_id:exact("d334c245-898b-4caa-b96c-954b8be31414")
  OR batch_id:exact("2186cea6-3ae5-40c1-805e-878cc123ca5c")
  OR batch_id:exact("7438ef94-89a7-48ae-83d2-71eb8a138f52"))
_time:[2026-03-10T13:00:00Z, 2026-03-10T20:00:00Z]
| stats by (batch_id) count(*) as log_lines

View raw logs: event-recorder | AIPT | ASU

service	`c2a627b3`	`215181db`	`812f79b7`	`d334c245`	`2186cea6`	`7438ef94`	total	÷2
event-recorder	348	44	136	128	44	108	808	404
AIPT	348	44	136	128	44	108	808	404
ASU	336	42	132	124	40	102	776	388

2. The previous RCA searched by the wrong ID

Cody provided intel_request_id values. These don't appear on message-processing log lines — those lines only have correlation_id (CID) and batch_id. To trace, you must map through the DB:

intel_request_id → metadata->>'correlation_id' (per-location, logged as CID)
                 → metadata->>'batch_id'       (per-request, = original correlation_id)

The batch_id is set to self.correlation_id in intel-requests-api (source). The mapping query was run on production-app-db-2-restore partition intel_requests_p20260301.

Evidence: DB query and ID mapping results

SELECT intel_request_id::text,
       (metadata->>'correlation_id')::text as correlation_id,
       (metadata->>'batch_id')::text as batch_id,
       created::text
FROM intel_requests.intel_requests_p20260301
WHERE intel_request_id IN ('56a219a6-...', 'bc2d6cef-...', ...)

intel_request_id	correlation_id	batch_id	created (UTC)
`56a219a6`	`11c84002`	`c2a627b3`	13:55:46
`bc2d6cef`	`e12e237a`	`c2a627b3`	13:55:46
`eac57da4`	`cec14379`	`215181db`	13:55:50
`27b6e33c`	`e64cd0a6`	`215181db`	13:55:50
`57007346`	`e5234837`	`812f79b7`	15:18:44
`819448ae`	`59355b93`	`812f79b7`	15:18:44
`55941d68`	`ab310e6b`	`812f79b7`	15:18:44
`9f3454be`	`af861b31`	`d334c245`	15:19:14
`2cfbd887`	`a578785d`	`2186cea6`	15:20:27
`90125c20`	`602b7ea5`	`7438ef94`	15:20:27

3. Logging is inconsistent across services — and Vector adds a layer

Why searching by intel_request_id is misleading:

Service	`intel_request_id` in VictoriaLogs?	Source	On `intel_requests.created` lines?	Proof
ASU	Yes	Vector extracts from `(Intel Request {uuid})` in `_msg` (VRL, Python)	No — only on `power_status.updated` discard lines	search returns 0 results, free-text finds only discard line
AIPT	Yes	Python logs it on `fetch prediction` step (source)	No — different log line	finds `fetch prediction`, not Processing/Processed
event-recorder	No	Only stores `correlation_id`, `organization_id` as metadata (source)	N/A	—

The shared amqp-consumer library (source) logs correlation_id, request_id, and batch_id on every Processing/Processed line. It never logs intel_request_id. These context fields are set at L632-642.

Vector's VRL pipeline (07-extract-ids.vrl) also extracts IDs from unstructured _msg text — (CID {uuid}) → correlation_id (VRL), (Batch ID {uuid}) → batch_id (VRL). All patterns use if !exists(...) guards so they don't conflict with structured JSON fields.

Impact: The structured fields in VictoriaLogs are correct for our batch-counting methodology. But searching by intel_request_id will find the wrong log lines (discard messages, not delivery proof).

4. What the previous RCA got right

Lock contention between incidents.received and intel_requests.created in message-amplifier IS real (confirmed in message-amplifier logs for batch c2a627b3)
The _on_message diagnostic logging in MR 17 is still the right next diagnostic step
Consumer-side loss IS happening — just only in ASU, not everywhere

5. Replay experiment independently confirms ASU-specific loss

A replay tool was built into the ASU image and run as K8s Jobs in production. It reads events from Loki (the same events event-recorder stored), replays them through ASU's Processor, and counts: 409 = already processed (ASU had it), 2xx = new (ASU missed it). The tool intentionally does NOT add 409 to allowed_errors so duplicates are counted separately (source).

200 replay jobs ran across 4 dates, covering multiple days of production data:

Run date	Jobs	New (missed)	409s (had it)	Loss rate
2026-01-15	35	4,438	428,977	1.024%
2026-01-16	112	608	2,968,829	0.021%
2026-01-17	5	272	3,024,879	0.009%
2026-02-16	48	183	522,793	0.035%
Total	200	5,501	6,945,478	0.079%

Verify: {service_name=~"replay-from-loki-hour-.*"} "TOTAL" over Jan 15 – Mar 16

The Jan 15 run had the highest rate (1.02%) because it was the first replay — subsequent runs found fewer new events as the replay itself backfilled gaps. The steady-state rate of ~0.02-0.04% on reruns suggests the loss is ongoing but at a lower rate than Cody's Mar 10 sample (0.93%).

Evidence: Sample replay output (hour 18, Dec 29 data)

Pod: replay-from-loki-hour-18-4bht6, ran 2026-01-17T13:47Z, replaying Dec 29 18:00-19:00 UTC

REPLAY SUMMARY
Total events fetched from Loki: 952,391
Successfully processed (new):   104
Duplicates (409):               872,513
Discarded (no API call):        79,774
Errors:                         0

By event type:
  Event Type                    New   409s   Skip
  asset_status.updated           13    788   2047
  incidents.deleted               0      5      7
  incidents.updated               0    532  22060
  intel_requests.created          1  41112     68
  power_status.updated            0  62941  55574
  search.expired                 90 767135     18
  TOTAL                         104 872513  79774

Verify: {service_name="replay-from-loki-hour-18-4bht6"} "REPLAY SUMMARY" OR "TOTAL" on Jan 17

Root Cause: Unknown

ASU loses messages at a consistent low rate (~0.08% overall, spiking to ~1% on specific days). event-recorder and AIPT (same library, same exchange) get 100%. Unverified theories:

Silent drop in ASU handler — early return without logging
Backpressure — ASU is 3x slower per message (0.12-0.15s vs 0.05s)
Queue config — different prefetch, replicas, or resource limits

Next Steps

Deploy MR 17 on ASU with LOG_ON_MESSAGE=true — proves whether messages reach ASU's callback
Audit ASU handler for silent return paths on intel_requests.created
Compare ASU queue config to event-recorder (prefetch, replicas, limits)

Verification after MR 17: compare "Received message" count on ASU vs event-recorder for same time window. Equal = handler bug. Lower = queue/delivery issue.