jmealo/2026-03-16-rca-missing-intel-requests-created.md

RCA: Missing intel_requests.created (ENG-3349) — Corrected


Status	PROVISIONAL — ASU-specific loss confirmed, root cause unknown
Supersedes	Previous RCA
Incident date	2026-03-10
Analysis date	2026-03-16
Reverted	intel-requests-api !48 (batch_id change from !47)

TL;DR

2,798/300,839 intel requests on March 10 never had facts created (~0.93%).

The previous RCA said all consumers were losing messages due to a shutdown race in amqp-consumer. Wrong. Only ASU is losing them. event-recorder and AIPT received 100%. The previous analysis searched logs by intel_request_id — a field that doesn't exist on the log lines that prove delivery.

Findings

1. Only ASU drops messages

Counted intel_requests.created processing logs per batch per consumer. Each message = 2 log lines (Processing + Processed) from the shared amqp-consumer library (source).

batch_id	event-recorder	AIPT	ASU	ASU lost
`c2a627b3`	174	174	168	6 (3.4%)
`215181db`	22	22	21	1 (4.5%)
`812f79b7`	68	68	66	2 (2.9%)
`d334c245`	64	64	62	2 (3.1%)
`2186cea6`	22	22	20	2 (9.1%)
`7438ef94`	54	54	51	3 (5.6%)
Total	404	404	388	16 (3.96%)

Verify: ASU batch c2a627b3 (expect 336 lines) vs event-recorder (expect 348 lines) vs AIPT (expect 348 lines) — or ASU all 6 batches (expect 776 lines)

Raw VictoriaLogs stats output

Query (run per service):

service_name:exact("{service}") AND message_type:exact("intel_requests.created")
AND (batch_id:exact("c2a627b3-3f7c-4c89-808c-4d79b19c69cb")
  OR batch_id:exact("215181db-45e2-474f-9c05-839eecf61cc5")
  OR batch_id:exact("812f79b7-c50d-4fd2-ac28-cc5d0318ee23")
  OR batch_id:exact("d334c245-898b-4caa-b96c-954b8be31414")
  OR batch_id:exact("2186cea6-3ae5-40c1-805e-878cc123ca5c")
  OR batch_id:exact("7438ef94-89a7-48ae-83d2-71eb8a138f52"))
_time:[2026-03-10T13:00:00Z, 2026-03-10T20:00:00Z]
| stats by (batch_id) count(*) as log_lines

event-recorder: c2a627b3=348, 215181db=44, 812f79b7=136, d334c245=128, 2186cea6=44, 7438ef94=108 (total 808 ÷ 2 = 404)
AIPT: c2a627b3=348, 215181db=44, 812f79b7=136, d334c245=128, 2186cea6=44, 7438ef94=108 (total 808 ÷ 2 = 404)
ASU: c2a627b3=336, 215181db=42, 812f79b7=132, d334c245=124, 2186cea6=40, 7438ef94=102 (total 776 ÷ 2 = 388)

2. The previous RCA searched by the wrong ID

Cody provided intel_request_id values. These don't appear on message-processing log lines — those lines only have correlation_id (CID) and batch_id. To trace, you must map through the DB:

intel_request_id → metadata->>'correlation_id' (per-location, logged as CID)
                 → metadata->>'batch_id'       (per-request, = original correlation_id)

The batch_id is set to self.correlation_id in intel-requests-api (source). The mapping query was run on production-app-db-2-restore partition intel_requests_p20260301.

DB query and results

SELECT intel_request_id::text,
       (metadata->>'correlation_id')::text as correlation_id,
       (metadata->>'batch_id')::text as batch_id,
       created::text
FROM intel_requests.intel_requests_p20260301
WHERE intel_request_id IN ('56a219a6-...', 'bc2d6cef-...', ...)

intel_request_id	correlation_id	batch_id	created (UTC)
`56a219a6`	`11c84002`	`c2a627b3`	13:55:46
`bc2d6cef`	`e12e237a`	`c2a627b3`	13:55:46
`eac57da4`	`cec14379`	`215181db`	13:55:50
`27b6e33c`	`e64cd0a6`	`215181db`	13:55:50
`57007346`	`e5234837`	`812f79b7`	15:18:44
`819448ae`	`59355b93`	`812f79b7`	15:18:44
`55941d68`	`ab310e6b`	`812f79b7`	15:18:44
`9f3454be`	`af861b31`	`d334c245`	15:19:14
`2cfbd887`	`a578785d`	`2186cea6`	15:20:27
`90125c20`	`602b7ea5`	`7438ef94`	15:20:27

3. Logging is inconsistent across services — and Vector adds a layer

Why searching by intel_request_id is misleading:

Service	`intel_request_id` in VictoriaLogs?	Source	Appears on `intel_requests.created` lines?
ASU	Yes	Vector extracts from `(Intel Request {uuid})` in `_msg` (VRL, Python)	No — only on `power_status.updated` discard lines
AIPT	Yes	Python logs it on `fetch prediction` step (source)	No — on a different log line, not Processing/Processed
event-recorder	No	Only stores `correlation_id`, `organization_id` as metadata (source)	N/A

The shared amqp-consumer library (source) logs correlation_id, request_id, and batch_id on every Processing/Processed line. It never logs intel_request_id. These context fields are set at L632-642.

Vector's VRL pipeline (07-extract-ids.vrl) also extracts IDs from unstructured _msg text — (CID {uuid}) → correlation_id (VRL), (Batch ID {uuid}) → batch_id (VRL). All patterns use if !exists(...) guards so they don't conflict with structured JSON fields.

Impact: The structured fields in VictoriaLogs are correct for our batch-counting methodology. But searching by intel_request_id will find the wrong log lines (discard messages, not delivery proof).

4. What the previous RCA got right

Lock contention between incidents.received and intel_requests.created in message-amplifier IS real (confirmed in message-amplifier logs for batch c2a627b3)
The _on_message diagnostic logging in MR 17 is still the right next diagnostic step
Consumer-side loss IS happening — just only in ASU, not everywhere

Root Cause: Unknown

ASU loses ~4% of intel_requests.created messages consistently. event-recorder and AIPT (same library, same exchange) get 100%. Unverified theories:

Silent drop in ASU handler — early return without logging
Backpressure — ASU is 3x slower per message (0.12-0.15s vs 0.05s)
Queue config — different prefetch, replicas, or resource limits

Next Steps

Deploy MR 17 on ASU with LOG_ON_MESSAGE=true — proves whether messages reach ASU's callback
Audit ASU handler for silent return paths on intel_requests.created
Compare ASU queue config to event-recorder (prefetch, replicas, limits)

Verification after MR 17: compare "Received message" count on ASU vs event-recorder for same time window. Equal = handler bug. Lower = queue/delivery issue.