| Status | PROVISIONAL — ASU-specific loss confirmed, root cause unknown |
| Supersedes | Previous RCA |
| Incident date | 2026-03-10 |
| Analysis date | 2026-03-16 |
| Reverted | intel-requests-api !48 (batch_id change from !47) |
2,798/300,839 intel requests on March 10 never had facts created (~0.93%).
The previous RCA said all consumers were losing messages due to a shutdown race in amqp-consumer. Wrong. Only ASU is losing them. event-recorder and AIPT received 100%. The previous analysis searched logs by intel_request_id — a field that doesn't exist on the log lines that prove delivery.
Counted intel_requests.created processing logs per batch per consumer. Each message = 2 log lines (Processing + Processed) from the shared amqp-consumer library (source).
| batch_id | event-recorder | AIPT | ASU | ASU lost |
|---|---|---|---|---|
c2a627b3 |
174 | 174 | 168 | 6 (3.4%) |
215181db |
22 | 22 | 21 | 1 (4.5%) |
812f79b7 |
68 | 68 | 66 | 2 (2.9%) |
d334c245 |
64 | 64 | 62 | 2 (3.1%) |
2186cea6 |
22 | 22 | 20 | 2 (9.1%) |
7438ef94 |
54 | 54 | 51 | 3 (5.6%) |
| Total | 404 | 404 | 388 | 16 (3.96%) |
Verify: ASU batch c2a627b3 (expect 336 lines) vs event-recorder (expect 348 lines) vs AIPT (expect 348 lines) — or ASU all 6 batches (expect 776 lines)
Evidence: Raw VictoriaLogs stats query + output
Query (run per service — replace {service} with service name):
service_name:exact("{service}") AND message_type:exact("intel_requests.created")
AND (batch_id:exact("c2a627b3-3f7c-4c89-808c-4d79b19c69cb")
OR batch_id:exact("215181db-45e2-474f-9c05-839eecf61cc5")
OR batch_id:exact("812f79b7-c50d-4fd2-ac28-cc5d0318ee23")
OR batch_id:exact("d334c245-898b-4caa-b96c-954b8be31414")
OR batch_id:exact("2186cea6-3ae5-40c1-805e-878cc123ca5c")
OR batch_id:exact("7438ef94-89a7-48ae-83d2-71eb8a138f52"))
_time:[2026-03-10T13:00:00Z, 2026-03-10T20:00:00Z]
| stats by (batch_id) count(*) as log_lines
View raw logs: event-recorder | AIPT | ASU
| service | c2a627b3 |
215181db |
812f79b7 |
d334c245 |
2186cea6 |
7438ef94 |
total | ÷2 |
|---|---|---|---|---|---|---|---|---|
| event-recorder | 348 | 44 | 136 | 128 | 44 | 108 | 808 | 404 |
| AIPT | 348 | 44 | 136 | 128 | 44 | 108 | 808 | 404 |
| ASU | 336 | 42 | 132 | 124 | 40 | 102 | 776 | 388 |
Cody provided intel_request_id values. These don't appear on message-processing log lines — those lines only have correlation_id (CID) and batch_id. To trace, you must map through the DB:
intel_request_id → metadata->>'correlation_id' (per-location, logged as CID)
→ metadata->>'batch_id' (per-request, = original correlation_id)
The batch_id is set to self.correlation_id in intel-requests-api (source). The mapping query was run on production-app-db-2-restore partition intel_requests_p20260301.
Evidence: DB query and ID mapping results
SELECT intel_request_id::text,
(metadata->>'correlation_id')::text as correlation_id,
(metadata->>'batch_id')::text as batch_id,
created::text
FROM intel_requests.intel_requests_p20260301
WHERE intel_request_id IN ('56a219a6-...', 'bc2d6cef-...', ...)| intel_request_id | correlation_id | batch_id | created (UTC) |
|---|---|---|---|
56a219a6 |
11c84002 |
c2a627b3 |
13:55:46 |
bc2d6cef |
e12e237a |
c2a627b3 |
13:55:46 |
eac57da4 |
cec14379 |
215181db |
13:55:50 |
27b6e33c |
e64cd0a6 |
215181db |
13:55:50 |
57007346 |
e5234837 |
812f79b7 |
15:18:44 |
819448ae |
59355b93 |
812f79b7 |
15:18:44 |
55941d68 |
ab310e6b |
812f79b7 |
15:18:44 |
9f3454be |
af861b31 |
d334c245 |
15:19:14 |
2cfbd887 |
a578785d |
2186cea6 |
15:20:27 |
90125c20 |
602b7ea5 |
7438ef94 |
15:20:27 |
Why searching by intel_request_id is misleading:
| Service | intel_request_id in VictoriaLogs? |
Source | On intel_requests.created lines? |
Proof |
|---|---|---|---|---|
| ASU | Yes | Vector extracts from (Intel Request {uuid}) in _msg (VRL, Python) |
No — only on power_status.updated discard lines |
search returns 0 results, free-text finds only discard line |
| AIPT | Yes | Python logs it on fetch prediction step (source) |
No — different log line | finds fetch prediction, not Processing/Processed |
| event-recorder | No | Only stores correlation_id, organization_id as metadata (source) |
N/A | — |
The shared amqp-consumer library (source) logs correlation_id, request_id, and batch_id on every Processing/Processed line. It never logs intel_request_id. These context fields are set at L632-642.
Vector's VRL pipeline (07-extract-ids.vrl) also extracts IDs from unstructured _msg text — (CID {uuid}) → correlation_id (VRL), (Batch ID {uuid}) → batch_id (VRL). All patterns use if !exists(...) guards so they don't conflict with structured JSON fields.
Impact: The structured fields in VictoriaLogs are correct for our batch-counting methodology. But searching by intel_request_id will find the wrong log lines (discard messages, not delivery proof).
- Lock contention between
incidents.receivedandintel_requests.createdin message-amplifier IS real (confirmed in message-amplifier logs for batch c2a627b3) - The
_on_messagediagnostic logging in MR 17 is still the right next diagnostic step - Consumer-side loss IS happening — just only in ASU, not everywhere
A replay tool was built into the ASU image and run as K8s Jobs in production. It reads events from Loki (the same events event-recorder stored), replays them through ASU's Processor, and counts: 409 = already processed (ASU had it), 2xx = new (ASU missed it). The tool intentionally does NOT add 409 to allowed_errors so duplicates are counted separately (source).
200 replay jobs ran across 4 dates, covering multiple days of production data:
| Run date | Jobs | New (missed) | 409s (had it) | Loss rate |
|---|---|---|---|---|
| 2026-01-15 | 35 | 4,438 | 428,977 | 1.024% |
| 2026-01-16 | 112 | 608 | 2,968,829 | 0.021% |
| 2026-01-17 | 5 | 272 | 3,024,879 | 0.009% |
| 2026-02-16 | 48 | 183 | 522,793 | 0.035% |
| Total | 200 | 5,501 | 6,945,478 | 0.079% |
Verify:
{service_name=~"replay-from-loki-hour-.*"} "TOTAL"over Jan 15 – Mar 16
The Jan 15 run had the highest rate (1.02%) because it was the first replay — subsequent runs found fewer new events as the replay itself backfilled gaps. The steady-state rate of ~0.02-0.04% on reruns suggests the loss is ongoing but at a lower rate than Cody's Mar 10 sample (0.93%).
Evidence: Sample replay output (hour 18, Dec 29 data)
Pod: replay-from-loki-hour-18-4bht6, ran 2026-01-17T13:47Z, replaying Dec 29 18:00-19:00 UTC
REPLAY SUMMARY
Total events fetched from Loki: 952,391
Successfully processed (new): 104
Duplicates (409): 872,513
Discarded (no API call): 79,774
Errors: 0
By event type:
Event Type New 409s Skip
asset_status.updated 13 788 2047
incidents.deleted 0 5 7
incidents.updated 0 532 22060
intel_requests.created 1 41112 68
power_status.updated 0 62941 55574
search.expired 90 767135 18
TOTAL 104 872513 79774
Verify:
{service_name="replay-from-loki-hour-18-4bht6"} "REPLAY SUMMARY" OR "TOTAL"on Jan 17
ASU loses messages at a consistent low rate (~0.08% overall, spiking to ~1% on specific days). event-recorder and AIPT (same library, same exchange) get 100%. Unverified theories:
- Silent drop in ASU handler — early return without logging
- Backpressure — ASU is 3x slower per message (0.12-0.15s vs 0.05s)
- Queue config — different prefetch, replicas, or resource limits
- Deploy MR 17 on ASU with
LOG_ON_MESSAGE=true— proves whether messages reach ASU's callback - Audit ASU handler for silent return paths on
intel_requests.created - Compare ASU queue config to event-recorder (prefetch, replicas, limits)
Verification after MR 17: compare "Received message" count on ASU vs event-recorder for same time window. Equal = handler bug. Lower = queue/delivery issue.