Skip to content

Instantly share code, notes, and snippets.

@jmealo
Last active March 16, 2026 23:16
Show Gist options
  • Select an option

  • Save jmealo/63d5c83972e1b9c3d74c1305fd0ce974 to your computer and use it in GitHub Desktop.

Select an option

Save jmealo/63d5c83972e1b9c3d74c1305fd0ce974 to your computer and use it in GitHub Desktop.
RCA: Missing intel_requests.created — Corrected Analysis (ENG-3349)

RCA: Missing intel_requests.created (ENG-3349) — Corrected

Status PROVISIONAL — ASU-specific loss confirmed, root cause unknown
Supersedes Previous RCA
Incident date 2026-03-10
Analysis date 2026-03-16
Reverted intel-requests-api !48 (batch_id change from !47)

TL;DR

2,798/300,839 intel requests on March 10 never had facts created (~0.93%).

The previous RCA said all consumers were losing messages due to a shutdown race in amqp-consumer. Wrong. Only ASU is losing them. event-recorder and AIPT received 100%. The previous analysis searched logs by intel_request_id — a field that doesn't exist on the log lines that prove delivery.


Findings

1. Only ASU drops messages

Counted intel_requests.created processing logs per batch per consumer. Each message = 2 log lines (Processing + Processed) from the shared amqp-consumer library (source).

batch_id event-recorder AIPT ASU ASU lost
c2a627b3 174 174 168 6 (3.4%)
215181db 22 22 21 1 (4.5%)
812f79b7 68 68 66 2 (2.9%)
d334c245 64 64 62 2 (3.1%)
2186cea6 22 22 20 2 (9.1%)
7438ef94 54 54 51 3 (5.6%)
Total 404 404 388 16 (3.96%)

Verify: ASU batch c2a627b3 (expect 336 lines) vs event-recorder (expect 348 lines) vs AIPT (expect 348 lines) — or ASU all 6 batches (expect 776 lines)

Raw VictoriaLogs stats output

Query (run per service):

service_name:exact("{service}") AND message_type:exact("intel_requests.created")
AND (batch_id:exact("c2a627b3-3f7c-4c89-808c-4d79b19c69cb")
  OR batch_id:exact("215181db-45e2-474f-9c05-839eecf61cc5")
  OR batch_id:exact("812f79b7-c50d-4fd2-ac28-cc5d0318ee23")
  OR batch_id:exact("d334c245-898b-4caa-b96c-954b8be31414")
  OR batch_id:exact("2186cea6-3ae5-40c1-805e-878cc123ca5c")
  OR batch_id:exact("7438ef94-89a7-48ae-83d2-71eb8a138f52"))
_time:[2026-03-10T13:00:00Z, 2026-03-10T20:00:00Z]
| stats by (batch_id) count(*) as log_lines
  • event-recorder: c2a627b3=348, 215181db=44, 812f79b7=136, d334c245=128, 2186cea6=44, 7438ef94=108 (total 808 ÷ 2 = 404)
  • AIPT: c2a627b3=348, 215181db=44, 812f79b7=136, d334c245=128, 2186cea6=44, 7438ef94=108 (total 808 ÷ 2 = 404)
  • ASU: c2a627b3=336, 215181db=42, 812f79b7=132, d334c245=124, 2186cea6=40, 7438ef94=102 (total 776 ÷ 2 = 388)

2. The previous RCA searched by the wrong ID

Cody provided intel_request_id values. These don't appear on message-processing log lines — those lines only have correlation_id (CID) and batch_id. To trace, you must map through the DB:

intel_request_id → metadata->>'correlation_id' (per-location, logged as CID)
                 → metadata->>'batch_id'       (per-request, = original correlation_id)

The batch_id is set to self.correlation_id in intel-requests-api (source). The mapping query was run on production-app-db-2-restore partition intel_requests_p20260301.

DB query and results
SELECT intel_request_id::text,
       (metadata->>'correlation_id')::text as correlation_id,
       (metadata->>'batch_id')::text as batch_id,
       created::text
FROM intel_requests.intel_requests_p20260301
WHERE intel_request_id IN ('56a219a6-...', 'bc2d6cef-...', ...)
intel_request_id correlation_id batch_id created (UTC)
56a219a6 11c84002 c2a627b3 13:55:46
bc2d6cef e12e237a c2a627b3 13:55:46
eac57da4 cec14379 215181db 13:55:50
27b6e33c e64cd0a6 215181db 13:55:50
57007346 e5234837 812f79b7 15:18:44
819448ae 59355b93 812f79b7 15:18:44
55941d68 ab310e6b 812f79b7 15:18:44
9f3454be af861b31 d334c245 15:19:14
2cfbd887 a578785d 2186cea6 15:20:27
90125c20 602b7ea5 7438ef94 15:20:27

3. Logging is inconsistent across services — and Vector adds a layer

Why searching by intel_request_id is misleading:

Service intel_request_id in VictoriaLogs? Source Appears on intel_requests.created lines?
ASU Yes Vector extracts from (Intel Request {uuid}) in _msg (VRL, Python) No — only on power_status.updated discard lines
AIPT Yes Python logs it on fetch prediction step (source) No — on a different log line, not Processing/Processed
event-recorder No Only stores correlation_id, organization_id as metadata (source) N/A

The shared amqp-consumer library (source) logs correlation_id, request_id, and batch_id on every Processing/Processed line. It never logs intel_request_id. These context fields are set at L632-642.

Vector's VRL pipeline (07-extract-ids.vrl) also extracts IDs from unstructured _msg text — (CID {uuid})correlation_id (VRL), (Batch ID {uuid})batch_id (VRL). All patterns use if !exists(...) guards so they don't conflict with structured JSON fields.

Impact: The structured fields in VictoriaLogs are correct for our batch-counting methodology. But searching by intel_request_id will find the wrong log lines (discard messages, not delivery proof).

4. What the previous RCA got right

  • Lock contention between incidents.received and intel_requests.created in message-amplifier IS real (confirmed in message-amplifier logs for batch c2a627b3)
  • The _on_message diagnostic logging in MR 17 is still the right next diagnostic step
  • Consumer-side loss IS happening — just only in ASU, not everywhere

Root Cause: Unknown

ASU loses ~4% of intel_requests.created messages consistently. event-recorder and AIPT (same library, same exchange) get 100%. Unverified theories:

  1. Silent drop in ASU handler — early return without logging
  2. Backpressure — ASU is 3x slower per message (0.12-0.15s vs 0.05s)
  3. Queue config — different prefetch, replicas, or resource limits

Next Steps

  1. Deploy MR 17 on ASU with LOG_ON_MESSAGE=true — proves whether messages reach ASU's callback
  2. Audit ASU handler for silent return paths on intel_requests.created
  3. Compare ASU queue config to event-recorder (prefetch, replicas, limits)

Verification after MR 17: compare "Received message" count on ASU vs event-recorder for same time window. Equal = handler bug. Lower = queue/delivery issue.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment