Skip to content

Instantly share code, notes, and snippets.

@jmealo
Last active March 16, 2026 23:16
Show Gist options
  • Select an option

  • Save jmealo/63d5c83972e1b9c3d74c1305fd0ce974 to your computer and use it in GitHub Desktop.

Select an option

Save jmealo/63d5c83972e1b9c3d74c1305fd0ce974 to your computer and use it in GitHub Desktop.
RCA: Missing intel_requests.created — Corrected Analysis (ENG-3349)

RCA: Missing intel_requests.created (ENG-3349) — Corrected

Status PROVISIONAL — ASU-specific loss confirmed, root cause unknown
Supersedes Previous RCA
Incident date 2026-03-10
Analysis date 2026-03-16
Reverted intel-requests-api !48 (batch_id change from !47)

TL;DR

2,798/300,839 intel requests on March 10 never had facts created (~0.93%).

The previous RCA said all consumers were losing messages due to a shutdown race in amqp-consumer. Wrong. Only ASU is losing them. event-recorder and AIPT received 100%. The previous analysis searched logs by intel_request_id — a field that doesn't exist on the log lines that prove delivery.


Findings

1. Only ASU drops messages

Counted intel_requests.created processing logs per batch per consumer. Each message = 2 log lines (Processing + Processed) from the shared amqp-consumer library (source).

batch_id event-recorder AIPT ASU ASU lost
c2a627b3 174 174 168 6 (3.4%)
215181db 22 22 21 1 (4.5%)
812f79b7 68 68 66 2 (2.9%)
d334c245 64 64 62 2 (3.1%)
2186cea6 22 22 20 2 (9.1%)
7438ef94 54 54 51 3 (5.6%)
Total 404 404 388 16 (3.96%)

Verify: ASU batch c2a627b3 (expect 336 lines) vs event-recorder (expect 348 lines) vs AIPT (expect 348 lines) — or ASU all 6 batches (expect 776 lines)

Evidence: Raw VictoriaLogs stats query + output

Query (run per service):

service_name:exact("{service}") AND message_type:exact("intel_requests.created")
AND (batch_id:exact("c2a627b3-3f7c-4c89-808c-4d79b19c69cb")
  OR batch_id:exact("215181db-45e2-474f-9c05-839eecf61cc5")
  OR batch_id:exact("812f79b7-c50d-4fd2-ac28-cc5d0318ee23")
  OR batch_id:exact("d334c245-898b-4caa-b96c-954b8be31414")
  OR batch_id:exact("2186cea6-3ae5-40c1-805e-878cc123ca5c")
  OR batch_id:exact("7438ef94-89a7-48ae-83d2-71eb8a138f52"))
_time:[2026-03-10T13:00:00Z, 2026-03-10T20:00:00Z]
| stats by (batch_id) count(*) as log_lines
service c2a627b3 215181db 812f79b7 d334c245 2186cea6 7438ef94 total ÷2
event-recorder 348 44 136 128 44 108 808 404
AIPT 348 44 136 128 44 108 808 404
ASU 336 42 132 124 40 102 776 388

2. The previous RCA searched by the wrong ID

Cody provided intel_request_id values. These don't appear on message-processing log lines — those lines only have correlation_id (CID) and batch_id. To trace, you must map through the DB:

intel_request_id → metadata->>'correlation_id' (per-location, logged as CID)
                 → metadata->>'batch_id'       (per-request, = original correlation_id)

The batch_id is set to self.correlation_id in intel-requests-api (source). The mapping query was run on production-app-db-2-restore partition intel_requests_p20260301.

Evidence: DB query and ID mapping results
SELECT intel_request_id::text,
       (metadata->>'correlation_id')::text as correlation_id,
       (metadata->>'batch_id')::text as batch_id,
       created::text
FROM intel_requests.intel_requests_p20260301
WHERE intel_request_id IN ('56a219a6-...', 'bc2d6cef-...', ...)
intel_request_id correlation_id batch_id created (UTC)
56a219a6 11c84002 c2a627b3 13:55:46
bc2d6cef e12e237a c2a627b3 13:55:46
eac57da4 cec14379 215181db 13:55:50
27b6e33c e64cd0a6 215181db 13:55:50
57007346 e5234837 812f79b7 15:18:44
819448ae 59355b93 812f79b7 15:18:44
55941d68 ab310e6b 812f79b7 15:18:44
9f3454be af861b31 d334c245 15:19:14
2cfbd887 a578785d 2186cea6 15:20:27
90125c20 602b7ea5 7438ef94 15:20:27

3. Logging is inconsistent across services — and Vector adds a layer

Why searching by intel_request_id is misleading:

Service intel_request_id in VictoriaLogs? Source Appears on intel_requests.created lines?
ASU Yes Vector extracts from (Intel Request {uuid}) in _msg (VRL, Python) No — only on power_status.updated discard lines
AIPT Yes Python logs it on fetch prediction step (source) No — on a different log line, not Processing/Processed
event-recorder No Only stores correlation_id, organization_id as metadata (source) N/A

The shared amqp-consumer library (source) logs correlation_id, request_id, and batch_id on every Processing/Processed line. It never logs intel_request_id. These context fields are set at L632-642.

Vector's VRL pipeline (07-extract-ids.vrl) also extracts IDs from unstructured _msg text — (CID {uuid})correlation_id (VRL), (Batch ID {uuid})batch_id (VRL). All patterns use if !exists(...) guards so they don't conflict with structured JSON fields.

Impact: The structured fields in VictoriaLogs are correct for our batch-counting methodology. But searching by intel_request_id will find the wrong log lines (discard messages, not delivery proof).

4. What the previous RCA got right

  • Lock contention between incidents.received and intel_requests.created in message-amplifier IS real (confirmed in message-amplifier logs for batch c2a627b3)
  • The _on_message diagnostic logging in MR 17 is still the right next diagnostic step
  • Consumer-side loss IS happening — just only in ASU, not everywhere

Root Cause: Unknown

ASU loses ~4% of intel_requests.created messages consistently. event-recorder and AIPT (same library, same exchange) get 100%. Unverified theories:

  1. Silent drop in ASU handler — early return without logging
  2. Backpressure — ASU is 3x slower per message (0.12-0.15s vs 0.05s)
  3. Queue config — different prefetch, replicas, or resource limits

Next Steps

  1. Deploy MR 17 on ASU with LOG_ON_MESSAGE=true — proves whether messages reach ASU's callback
  2. Audit ASU handler for silent return paths on intel_requests.created
  3. Compare ASU queue config to event-recorder (prefetch, replicas, limits)

Verification after MR 17: compare "Received message" count on ASU vs event-recorder for same time window. Equal = handler bug. Lower = queue/delivery issue.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment