| Status | PROVISIONAL — ASU-specific loss confirmed, root cause unknown |
| Supersedes | Previous RCA |
| Incident date | 2026-03-10 |
| Analysis date | 2026-03-16 |
| Reverted | intel-requests-api !48 (batch_id change from !47) |
2,798/300,839 intel requests on March 10 never had facts created (~0.93%).
The previous RCA said all consumers were losing messages due to a shutdown race in amqp-consumer. Wrong. Only ASU is losing them. event-recorder and AIPT received 100%. The previous analysis searched logs by intel_request_id — a field that doesn't exist on the log lines that prove delivery.
Counted intel_requests.created processing logs per batch per consumer. Each message = 2 log lines (Processing + Processed) from the shared amqp-consumer library (source).
| batch_id | event-recorder | AIPT | ASU | ASU lost |
|---|---|---|---|---|
c2a627b3 |
174 | 174 | 168 | 6 (3.4%) |
215181db |
22 | 22 | 21 | 1 (4.5%) |
812f79b7 |
68 | 68 | 66 | 2 (2.9%) |
d334c245 |
64 | 64 | 62 | 2 (3.1%) |
2186cea6 |
22 | 22 | 20 | 2 (9.1%) |
7438ef94 |
54 | 54 | 51 | 3 (5.6%) |
| Total | 404 | 404 | 388 | 16 (3.96%) |
Verify: ASU batch c2a627b3 (expect 336 lines) vs event-recorder (expect 348 lines) vs AIPT (expect 348 lines) — or ASU all 6 batches (expect 776 lines)
Evidence: Raw VictoriaLogs stats query + output
Query (run per service — replace {service} with service name):
{service_name="{service}"} message_type:"intel_requests.created"
AND batch_id:~"c2a627b3|215181db|812f79b7|d334c245|2186cea6|7438ef94"
_time:[2026-03-10T13:00:00Z, 2026-03-10T20:00:00Z]
| stats by (batch_id) count(*) as log_lines
View raw logs: event-recorder | AIPT | ASU
| service | c2a627b3 |
215181db |
812f79b7 |
d334c245 |
2186cea6 |
7438ef94 |
total | ÷2 |
|---|---|---|---|---|---|---|---|---|
| event-recorder | 348 | 44 | 136 | 128 | 44 | 108 | 808 | 404 |
| AIPT | 348 | 44 | 136 | 128 | 44 | 108 | 808 | 404 |
| ASU | 336 | 42 | 132 | 124 | 40 | 102 | 776 | 388 |
Cody provided intel_request_id values. These don't appear on message-processing log lines — those lines only have correlation_id (CID) and batch_id. To trace, you must map through the DB:
intel_request_id → metadata->>'correlation_id' (per-location, logged as CID)
→ metadata->>'batch_id' (per-request, = original correlation_id)
The batch_id is set to self.correlation_id in intel-requests-api (source). The mapping query was run on production-app-db-2-restore partition intel_requests_p20260301.
Evidence: DB query and ID mapping results
SELECT intel_request_id::text,
(metadata->>'correlation_id')::text as correlation_id,
(metadata->>'batch_id')::text as batch_id,
created::text
FROM intel_requests.intel_requests_p20260301
WHERE intel_request_id IN ('56a219a6-...', 'bc2d6cef-...', ...)| intel_request_id | correlation_id | batch_id | created (UTC) |
|---|---|---|---|
56a219a6 |
11c84002 |
c2a627b3 |
13:55:46 |
bc2d6cef |
e12e237a |
c2a627b3 |
13:55:46 |
eac57da4 |
cec14379 |
215181db |
13:55:50 |
27b6e33c |
e64cd0a6 |
215181db |
13:55:50 |
57007346 |
e5234837 |
812f79b7 |
15:18:44 |
819448ae |
59355b93 |
812f79b7 |
15:18:44 |
55941d68 |
ab310e6b |
812f79b7 |
15:18:44 |
9f3454be |
af861b31 |
d334c245 |
15:19:14 |
2cfbd887 |
a578785d |
2186cea6 |
15:20:27 |
90125c20 |
602b7ea5 |
7438ef94 |
15:20:27 |
Why searching by intel_request_id is misleading:
| Service | intel_request_id in VictoriaLogs? |
Source | On intel_requests.created lines? |
Proof |
|---|---|---|---|---|
| ASU | Yes | Vector extracts from (Intel Request {uuid}) in _msg (VRL, Python) |
No — only on power_status.updated discard lines |
search returns 0 results, free-text finds only discard line |
| AIPT | Yes | Python logs it on fetch prediction step (source) |
No — different log line | finds fetch prediction, not Processing/Processed |
| event-recorder | No | Only stores correlation_id, organization_id as metadata (source) |
N/A | — |
The shared amqp-consumer library (source) logs correlation_id, request_id, and batch_id on every Processing/Processed line. It never logs intel_request_id. These context fields are set at L632-642.
Vector's VRL pipeline (07-extract-ids.vrl) also extracts IDs from unstructured _msg text — (CID {uuid}) → correlation_id (VRL), (Batch ID {uuid}) → batch_id (VRL). All patterns use if !exists(...) guards so they don't conflict with structured JSON fields.
Impact: The structured fields in VictoriaLogs are correct for our batch-counting methodology. But searching by intel_request_id will find the wrong log lines (discard messages, not delivery proof).
- Lock contention between
incidents.receivedandintel_requests.createdin message-amplifier IS real (confirmed in message-amplifier logs for batch c2a627b3) - The
_on_messagediagnostic logging in MR 17 is still the right next diagnostic step - Consumer-side loss IS happening — just only in ASU, not everywhere
A replay tool was built into the ASU image and run as K8s Jobs in production. It reads events from Loki (the same events event-recorder stored), replays them through ASU's Processor, and counts: 409 = already processed (ASU had it), 2xx = new (ASU missed it). The tool intentionally does NOT add 409 to allowed_errors so duplicates are counted separately (source).
200 replay jobs ran across 4 dates, covering multiple days of production data:
| Run date | Jobs | New (missed) | 409s (had it) | Loss rate |
|---|---|---|---|---|
| 2026-01-15 | 35 | 4,438 | 428,977 | 1.024% |
| 2026-01-16 | 112 | 608 | 2,968,829 | 0.021% |
| 2026-01-17 | 5 | 272 | 3,024,879 | 0.009% |
| 2026-02-16 | 48 | 183 | 522,793 | 0.035% |
| Total | 200 | 5,501 | 6,945,478 | 0.079% |
Verify:
{service_name=~"replay-from-loki-hour-.*"} "TOTAL"over Jan 15 – Mar 16
The Jan 15 run had the highest rate (1.02%) because it was the first replay — subsequent runs found fewer new events as the replay itself backfilled gaps. The steady-state rate of ~0.02-0.04% on reruns suggests the loss is ongoing but at a lower rate than Cody's Mar 10 sample (0.93%).
Evidence: Sample replay output (hour 18, Dec 29 data)
Pod: replay-from-loki-hour-18-4bht6, ran 2026-01-17T13:47Z, replaying Dec 29 18:00-19:00 UTC
REPLAY SUMMARY
Total events fetched from Loki: 952,391
Successfully processed (new): 104
Duplicates (409): 872,513
Discarded (no API call): 79,774
Errors: 0
By event type:
Event Type New 409s Skip
asset_status.updated 13 788 2047
incidents.deleted 0 5 7
incidents.updated 0 532 22060
intel_requests.created 1 41112 68
power_status.updated 0 62941 55574
search.expired 90 767135 18
TOTAL 104 872513 79774
Verify:
{service_name="replay-from-loki-hour-18-4bht6"} "REPLAY SUMMARY" OR "TOTAL"on Jan 17
Acute — AMQP connection failures: On Jan 13 at 22:00 UTC, assets-api returned 500s which escalated into AMQP connection death (CHANNEL_ERROR, PRECONDITION_FAILED - unknown delivery tag, Failed to ACK message ... 'NoneType'). This single hour accounts for 1,975 of ~2,000 missed intel_requests.created events on that day (VLogs).
Evidence: Per-hour intel_requests.created replay results (Jan 13 data)
| Hour (UTC) | New (missed) | 409s | Loss rate | Note |
|---|---|---|---|---|
| 00-21 | 0-4 each | 5,000-16,000 | <0.05% | Steady state |
| 22 | 1,975 | 6,553 | 23.2% | AMQP connection death |
| 23 | 13 | 6,711 | 0.19% | Tail of recovery |
Verify:
{service_name=~"replay-from-loki-hour-.*"} " intel_requests.created"on Jan 15
Chronic — asset_status.updated code path: The _process_asset_status_updated_message handler has a name-to-UUID resolution path (processor.py:554-634) that uses a Redis cache with 24h TTL. When the cache is cold or the asset doesn't exist yet, _get_assets_by_name returns [] → zero API calls → event silently discarded. Other event types (power_status.updated, intel_requests.created) use stable UUIDs from the message body and never hit this cache — explaining why they have 0% loss at steady state.
| Event type | Asset identity | Cache dependency | Steady-state loss |
|---|---|---|---|
power_status.updated |
intel_request_id UUID in message |
None | 0% |
intel_requests.created |
intel_request_id UUID in message |
None | ~0% (only during AMQP failures) |
asset_status.updated |
Name→UUID via Redis cache (source) | 24h TTL | ~5% |
search.expired |
intel_request_id UUID in message |
None | ~0% (sporadic spikes) |
-
AMQP connection failures cause bulk loss of all event types during the failure window. Messages in-flight at connection death may fail to ACK, and RabbitMQ's
PRECONDITION_FAILEDon the new channel means the redelivered messages can't be acknowledged either. This is the dominant cause forintel_requests.createdloss (Cody's original report). -
Redis cache misses in
_get_assets_by_namecause chronicasset_status.updatedloss. When the cache returns empty (asset not yet created, or stale TTL), the handler returns[]and the event is silently discarded with no API call and no error log.
- Replay March 10 data with enhanced logging (MR 29) — the
NEW:lines now includecorrelation_idandbatch_id, letting us cross-reference against ASU production logs to confirm whether each missed event was a delivery failure (CID absent from ASU logs) or a processing failure (CID present but no API call) - Deploy MR 17 on ASU with
LOG_ON_MESSAGE=true— proves whether messages reach ASU's AMQP callback at all - Investigate AMQP reconnection behavior — the
PRECONDITION_FAILED - unknown delivery tagerror suggests messages processed on the old connection can't be ACKed after reconnect, potentially losing the entire in-flight batch - Consider adding logging to
_get_assets_by_namewhen it returns[]— currently this is a silent discard
- MR 29 merged and image built (
docker.gisual.net/backend/consumers/asset-status-updater:latest) - March 10 data still in Loki retention (verify with step 1)
kubectlaccess to production cluster
kubectl run --rm -it loki-check --image=curlimages/curl --restart=Never -- \
curl -s 'http://loki-gateway.monitoring.svc.cluster.local/loki/api/v1/query_range' \
--data-urlencode 'query={source="events", name="intel_requests.created"}' \
--data-urlencode 'start=1741564800' \
--data-urlencode 'end=1741568400' \
--data-urlencode 'limit=1' | head -c 200If empty/error → data has aged out, stop here.
cd ~/src/gisual/backend/consumers/asset-status-updater/scripts
# Dry run first — prints manifests without applying
./generate-parallel-jobs.sh 2026-03-10 --dry-run | head -50
# Apply (ensure IMAGE_TAG points to the build with MR 29)
IMAGE_TAG=latest ./generate-parallel-jobs.sh 2026-03-10# Watch job completion
kubectl get jobs -l app.kubernetes.io/name=replay-from-loki -w
# Tail a specific hour's logs
kubectl logs -f job/replay-from-loki-hour-13
# Check for errors across all jobs
kubectl logs -l app.kubernetes.io/name=replay-from-loki --tail=5 | grep -E "^(ERROR|REPLAY SUMMARY)"Wait for all 24 jobs to complete, then:
# Dump all summaries
for hour in $(seq -w 0 23); do
echo "=== Hour $hour ==="
kubectl logs job/replay-from-loki-hour-$hour 2>/dev/null | grep -A 20 "REPLAY SUMMARY"
echo
done > /tmp/replay-march10-summaries.txtOr via VictoriaLogs (after logs are shipped):
{service_name=~"replay-from-loki-hour-.*"} "TOTAL"
Time range: from job start to +4 hours.
# Get all NEW lines (events ASU missed)
for hour in $(seq -w 0 23); do
kubectl logs job/replay-from-loki-hour-$hour 2>/dev/null | grep "^.*NEW:"
done > /tmp/replay-march10-new-events.txt
# Count by event type
cat /tmp/replay-march10-new-events.txt | grep -oP 'NEW: \S+' | sort | uniq -c | sort -rnkubectl delete jobs -l app.kubernetes.io/name=replay-from-lokiOnce the replay completes and NEW: lines are collected:
# Parse correlation_ids from the enhanced log format
grep "NEW:" /tmp/replay-march10-new-events.txt \
| grep -oP 'correlation_id=\K[a-f0-9-]{36}' \
| sort -u > /tmp/new-correlation-ids.txt
wc -l /tmp/new-correlation-ids.txt # How many missed eventsFor each correlation_id, search ASU's production logs to see if it logged "Processing message type" for that CID:
{service_name="asset-status-updater"} correlation_id:"{cid}" AND "Processing"
Time range: March 10 00:00–24:00 UTC (start=1773100800000&end=1773187200000).
Batch check (sample 10 CIDs):
head -10 /tmp/new-correlation-ids.txt | while read cid; do
echo -n "$cid: "
curl -s 'http://vlogs-production.monitoring.svc.cluster.local/select/logsql/query' \
--data-urlencode "query={service_name=\"asset-status-updater\"} correlation_id:\"$cid\" AND \"Processing\"" \
--data-urlencode 'start=2026-03-10T00:00:00Z' \
--data-urlencode 'end=2026-03-11T00:00:00Z' \
--data-urlencode 'limit=1' | wc -c
# >2 bytes = found, 0-2 bytes = not found
done| ASU logged "Processing" for CID? | Meaning | Root cause |
|---|---|---|
| No | RabbitMQ never delivered the message to ASU | AMQP connection failure or queue-level loss |
| Yes, and handler made API calls | ASU processed it but the POST failed (transient error) and retries exhausted | Retry/error-exchange exhaustion |
| Yes, but 0 API calls (DISCARDED in replay) | _get_assets_by_name returned empty at original time |
Redis cache miss or asset not yet created |
Check if the missed events cluster around AMQP error windows:
{service_name="asset-status-updater"} level:"error" _time:[2026-03-10T00:00:00Z, 2026-03-11T00:00:00Z] | stats by (_time:15m) count(*) as errors
If NEW events cluster in the same 15-minute windows as errors → AMQP connection failure is the dominant cause. If they're spread evenly → application-layer (cache/handler) issue.
Update this RCA with:
- Total NEW count for March 10 (compare to Cody's 2,798)
- Breakdown: delivery-layer vs application-layer loss
- Whether the pattern matches Jan 13 (hour-22 spike) or is different