| Status | PROVISIONAL — ASU-specific loss confirmed, root cause unknown |
| Supersedes | Previous RCA |
| Incident date | 2026-03-10 |
| Analysis date | 2026-03-16 |
| Reverted | intel-requests-api !48 (batch_id change from !47) |
2,798/300,839 intel requests on March 10 never had facts created (~0.93%).
The previous RCA said all consumers were losing messages due to a shutdown race in amqp-consumer. Wrong. Only ASU is losing them. event-recorder and AIPT received 100%. The previous analysis searched logs by intel_request_id — a field that doesn't exist on the log lines that prove delivery.
Counted intel_requests.created processing logs per batch per consumer. Each message = 2 log lines (Processing + Processed) from the shared amqp-consumer library (source).
| batch_id | event-recorder | AIPT | ASU | ASU lost |
|---|---|---|---|---|
c2a627b3 |
174 | 174 | 168 | 6 (3.4%) |
215181db |
22 | 22 | 21 | 1 (4.5%) |
812f79b7 |
68 | 68 | 66 | 2 (2.9%) |
d334c245 |
64 | 64 | 62 | 2 (3.1%) |
2186cea6 |
22 | 22 | 20 | 2 (9.1%) |
7438ef94 |
54 | 54 | 51 | 3 (5.6%) |
| Total | 404 | 404 | 388 | 16 (3.96%) |
Verify: ASU batch c2a627b3 (expect 336 lines) vs event-recorder (expect 348 lines) vs AIPT (expect 348 lines) — or ASU all 6 batches (expect 776 lines)
Evidence: Raw VictoriaLogs stats query + output
Query (run per service — replace {service} with service name):
service_name:exact("{service}") AND message_type:exact("intel_requests.created")
AND (batch_id:exact("c2a627b3-3f7c-4c89-808c-4d79b19c69cb")
OR batch_id:exact("215181db-45e2-474f-9c05-839eecf61cc5")
OR batch_id:exact("812f79b7-c50d-4fd2-ac28-cc5d0318ee23")
OR batch_id:exact("d334c245-898b-4caa-b96c-954b8be31414")
OR batch_id:exact("2186cea6-3ae5-40c1-805e-878cc123ca5c")
OR batch_id:exact("7438ef94-89a7-48ae-83d2-71eb8a138f52"))
_time:[2026-03-10T13:00:00Z, 2026-03-10T20:00:00Z]
| stats by (batch_id) count(*) as log_lines
View raw logs: event-recorder | AIPT | ASU
| service | c2a627b3 |
215181db |
812f79b7 |
d334c245 |
2186cea6 |
7438ef94 |
total | ÷2 |
|---|---|---|---|---|---|---|---|---|
| event-recorder | 348 | 44 | 136 | 128 | 44 | 108 | 808 | 404 |
| AIPT | 348 | 44 | 136 | 128 | 44 | 108 | 808 | 404 |
| ASU | 336 | 42 | 132 | 124 | 40 | 102 | 776 | 388 |
Cody provided intel_request_id values. These don't appear on message-processing log lines — those lines only have correlation_id (CID) and batch_id. To trace, you must map through the DB:
intel_request_id → metadata->>'correlation_id' (per-location, logged as CID)
→ metadata->>'batch_id' (per-request, = original correlation_id)
The batch_id is set to self.correlation_id in intel-requests-api (source). The mapping query was run on production-app-db-2-restore partition intel_requests_p20260301.
Evidence: DB query and ID mapping results
SELECT intel_request_id::text,
(metadata->>'correlation_id')::text as correlation_id,
(metadata->>'batch_id')::text as batch_id,
created::text
FROM intel_requests.intel_requests_p20260301
WHERE intel_request_id IN ('56a219a6-...', 'bc2d6cef-...', ...)| intel_request_id | correlation_id | batch_id | created (UTC) |
|---|---|---|---|
56a219a6 |
11c84002 |
c2a627b3 |
13:55:46 |
bc2d6cef |
e12e237a |
c2a627b3 |
13:55:46 |
eac57da4 |
cec14379 |
215181db |
13:55:50 |
27b6e33c |
e64cd0a6 |
215181db |
13:55:50 |
57007346 |
e5234837 |
812f79b7 |
15:18:44 |
819448ae |
59355b93 |
812f79b7 |
15:18:44 |
55941d68 |
ab310e6b |
812f79b7 |
15:18:44 |
9f3454be |
af861b31 |
d334c245 |
15:19:14 |
2cfbd887 |
a578785d |
2186cea6 |
15:20:27 |
90125c20 |
602b7ea5 |
7438ef94 |
15:20:27 |
Why searching by intel_request_id is misleading:
| Service | intel_request_id in VictoriaLogs? |
Source | On intel_requests.created lines? |
Proof |
|---|---|---|---|---|
| ASU | Yes | Vector extracts from (Intel Request {uuid}) in _msg (VRL, Python) |
No — only on power_status.updated discard lines |
search returns 0 results, free-text finds only discard line |
| AIPT | Yes | Python logs it on fetch prediction step (source) |
No — different log line | finds fetch prediction, not Processing/Processed |
| event-recorder | No | Only stores correlation_id, organization_id as metadata (source) |
N/A | — |
The shared amqp-consumer library (source) logs correlation_id, request_id, and batch_id on every Processing/Processed line. It never logs intel_request_id. These context fields are set at L632-642.
Vector's VRL pipeline (07-extract-ids.vrl) also extracts IDs from unstructured _msg text — (CID {uuid}) → correlation_id (VRL), (Batch ID {uuid}) → batch_id (VRL). All patterns use if !exists(...) guards so they don't conflict with structured JSON fields.
Impact: The structured fields in VictoriaLogs are correct for our batch-counting methodology. But searching by intel_request_id will find the wrong log lines (discard messages, not delivery proof).
- Lock contention between
incidents.receivedandintel_requests.createdin message-amplifier IS real (confirmed in message-amplifier logs for batch c2a627b3) - The
_on_messagediagnostic logging in MR 17 is still the right next diagnostic step - Consumer-side loss IS happening — just only in ASU, not everywhere
A replay tool was built into the ASU image and run as K8s Jobs in production. It reads events from Loki (the same events event-recorder stored), replays them through ASU's Processor, and counts: 409 = already processed (ASU had it), 2xx = new (ASU missed it). The tool intentionally does NOT add 409 to allowed_errors so duplicates are counted separately (source).
200 replay jobs ran across 4 dates, covering multiple days of production data:
| Run date | Jobs | New (missed) | 409s (had it) | Loss rate |
|---|---|---|---|---|
| 2026-01-15 | 35 | 4,438 | 428,977 | 1.024% |
| 2026-01-16 | 112 | 608 | 2,968,829 | 0.021% |
| 2026-01-17 | 5 | 272 | 3,024,879 | 0.009% |
| 2026-02-16 | 48 | 183 | 522,793 | 0.035% |
| Total | 200 | 5,501 | 6,945,478 | 0.079% |
Verify:
{service_name=~"replay-from-loki-hour-.*"} "TOTAL"over Jan 15 – Mar 16
The Jan 15 run had the highest rate (1.02%) because it was the first replay — subsequent runs found fewer new events as the replay itself backfilled gaps. The steady-state rate of ~0.02-0.04% on reruns suggests the loss is ongoing but at a lower rate than Cody's Mar 10 sample (0.93%).
Evidence: Sample replay output (hour 18, Dec 29 data)
Pod: replay-from-loki-hour-18-4bht6, ran 2026-01-17T13:47Z, replaying Dec 29 18:00-19:00 UTC
REPLAY SUMMARY
Total events fetched from Loki: 952,391
Successfully processed (new): 104
Duplicates (409): 872,513
Discarded (no API call): 79,774
Errors: 0
By event type:
Event Type New 409s Skip
asset_status.updated 13 788 2047
incidents.deleted 0 5 7
incidents.updated 0 532 22060
intel_requests.created 1 41112 68
power_status.updated 0 62941 55574
search.expired 90 767135 18
TOTAL 104 872513 79774
Verify:
{service_name="replay-from-loki-hour-18-4bht6"} "REPLAY SUMMARY" OR "TOTAL"on Jan 17
Acute — AMQP connection failures: On Jan 13 at 22:00 UTC, assets-api returned 500s which escalated into AMQP connection death (CHANNEL_ERROR, PRECONDITION_FAILED - unknown delivery tag, Failed to ACK message ... 'NoneType'). This single hour accounts for 1,975 of ~2,000 missed intel_requests.created events on that day (VLogs).
Evidence: Per-hour intel_requests.created replay results (Jan 13 data)
| Hour (UTC) | New (missed) | 409s | Loss rate | Note |
|---|---|---|---|---|
| 00-21 | 0-4 each | 5,000-16,000 | <0.05% | Steady state |
| 22 | 1,975 | 6,553 | 23.2% | AMQP connection death |
| 23 | 13 | 6,711 | 0.19% | Tail of recovery |
Verify:
{service_name=~"replay-from-loki-hour-.*"} " intel_requests.created"on Jan 15
Chronic — asset_status.updated code path: The _process_asset_status_updated_message handler has a name-to-UUID resolution path (processor.py:554-634) that uses a Redis cache with 24h TTL. When the cache is cold or the asset doesn't exist yet, _get_assets_by_name returns [] → zero API calls → event silently discarded. Other event types (power_status.updated, intel_requests.created) use stable UUIDs from the message body and never hit this cache — explaining why they have 0% loss at steady state.
| Event type | Asset identity | Cache dependency | Steady-state loss |
|---|---|---|---|
power_status.updated |
intel_request_id UUID in message |
None | 0% |
intel_requests.created |
intel_request_id UUID in message |
None | ~0% (only during AMQP failures) |
asset_status.updated |
Name→UUID via Redis cache (source) | 24h TTL | ~5% |
search.expired |
intel_request_id UUID in message |
None | ~0% (sporadic spikes) |
-
AMQP connection failures cause bulk loss of all event types during the failure window. Messages in-flight at connection death may fail to ACK, and RabbitMQ's
PRECONDITION_FAILEDon the new channel means the redelivered messages can't be acknowledged either. This is the dominant cause forintel_requests.createdloss (Cody's original report). -
Redis cache misses in
_get_assets_by_namecause chronicasset_status.updatedloss. When the cache returns empty (asset not yet created, or stale TTL), the handler returns[]and the event is silently discarded with no API call and no error log.
- Replay March 10 data with enhanced logging (MR 29) — the
NEW:lines now includecorrelation_idandbatch_id, letting us cross-reference against ASU production logs to confirm whether each missed event was a delivery failure (CID absent from ASU logs) or a processing failure (CID present but no API call) - Deploy MR 17 on ASU with
LOG_ON_MESSAGE=true— proves whether messages reach ASU's AMQP callback at all - Investigate AMQP reconnection behavior — the
PRECONDITION_FAILED - unknown delivery tagerror suggests messages processed on the old connection can't be ACKed after reconnect, potentially losing the entire in-flight batch - Consider adding logging to
_get_assets_by_namewhen it returns[]— currently this is a silent discard
Verification after replay: grep NEW: lines from replay logs, extract correlation_ids, then:
{service_name="asset-status-updater"} correlation_id:exact("{cid}") AND "Processing"
Present = ASU received it but processing failed silently. Absent = RabbitMQ never delivered it.