jmealo/2026-03-16-rca-missing-intel-requests-created.md

RCA: Missing intel_requests.created (ENG-3349) — Corrected


Status	PROVISIONAL — ASU-specific loss confirmed, root cause unknown
Supersedes	Previous RCA
Incident date	2026-03-10
Analysis date	2026-03-16
Reverted	intel-requests-api !48 (batch_id change from !47)

TL;DR

2,798/300,839 intel requests on March 10 never had facts created (~0.93%).

The previous RCA said all consumers were losing messages due to a shutdown race in amqp-consumer. Wrong. Only ASU is losing them. event-recorder and AIPT received 100%. The previous analysis searched logs by intel_request_id — a field that doesn't exist on the log lines that prove delivery.

Findings

1. Only ASU drops messages

Counted intel_requests.created processing logs per batch per consumer. Each message = 2 log lines (Processing + Processed) from the shared amqp-consumer library (source).

batch_id	event-recorder	AIPT	ASU	ASU lost
`c2a627b3`	174	174	168	6 (3.4%)
`215181db`	22	22	21	1 (4.5%)
`812f79b7`	68	68	66	2 (2.9%)
`d334c245`	64	64	62	2 (3.1%)
`2186cea6`	22	22	20	2 (9.1%)
`7438ef94`	54	54	51	3 (5.6%)
Total	404	404	388	16 (3.96%)

Verify: ASU batch c2a627b3 (expect 336 lines) vs event-recorder (expect 348 lines) vs AIPT (expect 348 lines) — or ASU all 6 batches (expect 776 lines)

Evidence: Raw VictoriaLogs stats query + output

Query (run per service):

service_name:exact("{service}") AND message_type:exact("intel_requests.created")
AND (batch_id:exact("c2a627b3-3f7c-4c89-808c-4d79b19c69cb")
  OR batch_id:exact("215181db-45e2-474f-9c05-839eecf61cc5")
  OR batch_id:exact("812f79b7-c50d-4fd2-ac28-cc5d0318ee23")
  OR batch_id:exact("d334c245-898b-4caa-b96c-954b8be31414")
  OR batch_id:exact("2186cea6-3ae5-40c1-805e-878cc123ca5c")
  OR batch_id:exact("7438ef94-89a7-48ae-83d2-71eb8a138f52"))
_time:[2026-03-10T13:00:00Z, 2026-03-10T20:00:00Z]
| stats by (batch_id) count(*) as log_lines

service	`c2a627b3`	`215181db`	`812f79b7`	`d334c245`	`2186cea6`	`7438ef94`	total	÷2
event-recorder	348	44	136	128	44	108	808	404
AIPT	348	44	136	128	44	108	808	404
ASU	336	42	132	124	40	102	776	388

2. The previous RCA searched by the wrong ID

Cody provided intel_request_id values. These don't appear on message-processing log lines — those lines only have correlation_id (CID) and batch_id. To trace, you must map through the DB:

intel_request_id → metadata->>'correlation_id' (per-location, logged as CID)
                 → metadata->>'batch_id'       (per-request, = original correlation_id)

The batch_id is set to self.correlation_id in intel-requests-api (source). The mapping query was run on production-app-db-2-restore partition intel_requests_p20260301.

Evidence: DB query and ID mapping results

SELECT intel_request_id::text,
       (metadata->>'correlation_id')::text as correlation_id,
       (metadata->>'batch_id')::text as batch_id,
       created::text
FROM intel_requests.intel_requests_p20260301
WHERE intel_request_id IN ('56a219a6-...', 'bc2d6cef-...', ...)

intel_request_id	correlation_id	batch_id	created (UTC)
`56a219a6`	`11c84002`	`c2a627b3`	13:55:46
`bc2d6cef`	`e12e237a`	`c2a627b3`	13:55:46
`eac57da4`	`cec14379`	`215181db`	13:55:50
`27b6e33c`	`e64cd0a6`	`215181db`	13:55:50
`57007346`	`e5234837`	`812f79b7`	15:18:44
`819448ae`	`59355b93`	`812f79b7`	15:18:44
`55941d68`	`ab310e6b`	`812f79b7`	15:18:44
`9f3454be`	`af861b31`	`d334c245`	15:19:14
`2cfbd887`	`a578785d`	`2186cea6`	15:20:27
`90125c20`	`602b7ea5`	`7438ef94`	15:20:27

3. Logging is inconsistent across services — and Vector adds a layer

Why searching by intel_request_id is misleading:

Service	`intel_request_id` in VictoriaLogs?	Source	Appears on `intel_requests.created` lines?
ASU	Yes	Vector extracts from `(Intel Request {uuid})` in `_msg` (VRL, Python)	No — only on `power_status.updated` discard lines
AIPT	Yes	Python logs it on `fetch prediction` step (source)	No — on a different log line, not Processing/Processed
event-recorder	No	Only stores `correlation_id`, `organization_id` as metadata (source)	N/A

The shared amqp-consumer library (source) logs correlation_id, request_id, and batch_id on every Processing/Processed line. It never logs intel_request_id. These context fields are set at L632-642.

Vector's VRL pipeline (07-extract-ids.vrl) also extracts IDs from unstructured _msg text — (CID {uuid}) → correlation_id (VRL), (Batch ID {uuid}) → batch_id (VRL). All patterns use if !exists(...) guards so they don't conflict with structured JSON fields.

Impact: The structured fields in VictoriaLogs are correct for our batch-counting methodology. But searching by intel_request_id will find the wrong log lines (discard messages, not delivery proof).

4. What the previous RCA got right

Lock contention between incidents.received and intel_requests.created in message-amplifier IS real (confirmed in message-amplifier logs for batch c2a627b3)
The _on_message diagnostic logging in MR 17 is still the right next diagnostic step
Consumer-side loss IS happening — just only in ASU, not everywhere

Root Cause: Unknown

ASU loses ~4% of intel_requests.created messages consistently. event-recorder and AIPT (same library, same exchange) get 100%. Unverified theories:

Silent drop in ASU handler — early return without logging
Backpressure — ASU is 3x slower per message (0.12-0.15s vs 0.05s)
Queue config — different prefetch, replicas, or resource limits

Next Steps

Deploy MR 17 on ASU with LOG_ON_MESSAGE=true — proves whether messages reach ASU's callback
Audit ASU handler for silent return paths on intel_requests.created
Compare ASU queue config to event-recorder (prefetch, replicas, limits)

Verification after MR 17: compare "Received message" count on ASU vs event-recorder for same time window. Equal = handler bug. Lower = queue/delivery issue.