| Status | PROVISIONAL — ASU-specific loss confirmed, root cause unknown |
| Supersedes | Previous RCA |
| Incident date | 2026-03-10 |
| Analysis date | 2026-03-16 |
| Reverted | intel-requests-api !48 (batch_id change from !47) |
2,798/300,839 intel requests on March 10 never had facts created (~0.93%).
The previous RCA said all consumers were losing messages due to a shutdown race in amqp-consumer. Wrong. Only ASU is losing them. event-recorder and AIPT received 100%. The previous analysis searched logs by intel_request_id — a field that doesn't exist on the log lines that prove delivery.
Counted intel_requests.created processing logs per batch per consumer. Each message = 2 log lines (Processing + Processed) from the shared amqp-consumer library (source).
| batch_id | event-recorder | AIPT | ASU | ASU lost |
|---|---|---|---|---|
c2a627b3 |
174 | 174 | 168 | 6 (3.4%) |
215181db |
22 | 22 | 21 | 1 (4.5%) |
812f79b7 |
68 | 68 | 66 | 2 (2.9%) |
d334c245 |
64 | 64 | 62 | 2 (3.1%) |
2186cea6 |
22 | 22 | 20 | 2 (9.1%) |
7438ef94 |
54 | 54 | 51 | 3 (5.6%) |
| Total | 404 | 404 | 388 | 16 (3.96%) |
Verify: ASU batch c2a627b3 (expect 336 lines) vs event-recorder (expect 348 lines) vs AIPT (expect 348 lines) — or ASU all 6 batches (expect 776 lines)
Evidence: Raw VictoriaLogs stats query + output
Query (run per service):
service_name:exact("{service}") AND message_type:exact("intel_requests.created")
AND (batch_id:exact("c2a627b3-3f7c-4c89-808c-4d79b19c69cb")
OR batch_id:exact("215181db-45e2-474f-9c05-839eecf61cc5")
OR batch_id:exact("812f79b7-c50d-4fd2-ac28-cc5d0318ee23")
OR batch_id:exact("d334c245-898b-4caa-b96c-954b8be31414")
OR batch_id:exact("2186cea6-3ae5-40c1-805e-878cc123ca5c")
OR batch_id:exact("7438ef94-89a7-48ae-83d2-71eb8a138f52"))
_time:[2026-03-10T13:00:00Z, 2026-03-10T20:00:00Z]
| stats by (batch_id) count(*) as log_lines
| service | c2a627b3 |
215181db |
812f79b7 |
d334c245 |
2186cea6 |
7438ef94 |
total | ÷2 |
|---|---|---|---|---|---|---|---|---|
| event-recorder | 348 | 44 | 136 | 128 | 44 | 108 | 808 | 404 |
| AIPT | 348 | 44 | 136 | 128 | 44 | 108 | 808 | 404 |
| ASU | 336 | 42 | 132 | 124 | 40 | 102 | 776 | 388 |
Cody provided intel_request_id values. These don't appear on message-processing log lines — those lines only have correlation_id (CID) and batch_id. To trace, you must map through the DB:
intel_request_id → metadata->>'correlation_id' (per-location, logged as CID)
→ metadata->>'batch_id' (per-request, = original correlation_id)
The batch_id is set to self.correlation_id in intel-requests-api (source). The mapping query was run on production-app-db-2-restore partition intel_requests_p20260301.
Evidence: DB query and ID mapping results
SELECT intel_request_id::text,
(metadata->>'correlation_id')::text as correlation_id,
(metadata->>'batch_id')::text as batch_id,
created::text
FROM intel_requests.intel_requests_p20260301
WHERE intel_request_id IN ('56a219a6-...', 'bc2d6cef-...', ...)| intel_request_id | correlation_id | batch_id | created (UTC) |
|---|---|---|---|
56a219a6 |
11c84002 |
c2a627b3 |
13:55:46 |
bc2d6cef |
e12e237a |
c2a627b3 |
13:55:46 |
eac57da4 |
cec14379 |
215181db |
13:55:50 |
27b6e33c |
e64cd0a6 |
215181db |
13:55:50 |
57007346 |
e5234837 |
812f79b7 |
15:18:44 |
819448ae |
59355b93 |
812f79b7 |
15:18:44 |
55941d68 |
ab310e6b |
812f79b7 |
15:18:44 |
9f3454be |
af861b31 |
d334c245 |
15:19:14 |
2cfbd887 |
a578785d |
2186cea6 |
15:20:27 |
90125c20 |
602b7ea5 |
7438ef94 |
15:20:27 |
Why searching by intel_request_id is misleading:
| Service | intel_request_id in VictoriaLogs? |
Source | Appears on intel_requests.created lines? |
|---|---|---|---|
| ASU | Yes | Vector extracts from (Intel Request {uuid}) in _msg (VRL, Python) |
No — only on power_status.updated discard lines |
| AIPT | Yes | Python logs it on fetch prediction step (source) |
No — on a different log line, not Processing/Processed |
| event-recorder | No | Only stores correlation_id, organization_id as metadata (source) |
N/A |
The shared amqp-consumer library (source) logs correlation_id, request_id, and batch_id on every Processing/Processed line. It never logs intel_request_id. These context fields are set at L632-642.
Vector's VRL pipeline (07-extract-ids.vrl) also extracts IDs from unstructured _msg text — (CID {uuid}) → correlation_id (VRL), (Batch ID {uuid}) → batch_id (VRL). All patterns use if !exists(...) guards so they don't conflict with structured JSON fields.
Impact: The structured fields in VictoriaLogs are correct for our batch-counting methodology. But searching by intel_request_id will find the wrong log lines (discard messages, not delivery proof).
- Lock contention between
incidents.receivedandintel_requests.createdin message-amplifier IS real (confirmed in message-amplifier logs for batch c2a627b3) - The
_on_messagediagnostic logging in MR 17 is still the right next diagnostic step - Consumer-side loss IS happening — just only in ASU, not everywhere
ASU loses ~4% of intel_requests.created messages consistently. event-recorder and AIPT (same library, same exchange) get 100%. Unverified theories:
- Silent drop in ASU handler — early return without logging
- Backpressure — ASU is 3x slower per message (0.12-0.15s vs 0.05s)
- Queue config — different prefetch, replicas, or resource limits
- Deploy MR 17 on ASU with
LOG_ON_MESSAGE=true— proves whether messages reach ASU's callback - Audit ASU handler for silent return paths on
intel_requests.created - Compare ASU queue config to event-recorder (prefetch, replicas, limits)
Verification after MR 17: compare "Received message" count on ASU vs event-recorder for same time window. Equal = handler bug. Lower = queue/delivery issue.