jmealo/2026-03-16-rca-missing-intel-requests-created.md

RCA: Missing intel_requests.created (ENG-3349) — Corrected


Status	PROVISIONAL — ASU-specific loss confirmed, root cause unknown
Supersedes	Previous RCA
Incident date	2026-03-10
Analysis date	2026-03-16
Reverted	intel-requests-api !48 (batch_id change from !47)

TL;DR

2,798/300,839 intel requests on March 10 never had facts created (~0.93%).

The previous RCA said all consumers were losing messages due to a shutdown race in amqp-consumer. Wrong. Only ASU is losing them. event-recorder and AIPT received 100%. The previous analysis searched logs by intel_request_id — a field that doesn't exist on the log lines that prove delivery.

Findings

1. Only ASU drops messages

Counted intel_requests.created processing logs per batch per consumer. Each message = 2 log lines (Processing + Processed) from the shared amqp-consumer library (source).

batch_id	event-recorder	AIPT	ASU	ASU lost
`c2a627b3`	174	174	168	6 (3.4%)
`215181db`	22	22	21	1 (4.5%)
`812f79b7`	68	68	66	2 (2.9%)
`d334c245`	64	64	62	2 (3.1%)
`2186cea6`	22	22	20	2 (9.1%)
`7438ef94`	54	54	51	3 (5.6%)
Total	404	404	388	16 (3.96%)

Verify: ASU batch c2a627b3 (expect 336 lines) vs event-recorder (expect 348 lines) vs AIPT (expect 348 lines) — or ASU all 6 batches (expect 776 lines)

Evidence: Raw VictoriaLogs stats query + output

Query (run per service — replace {service} with service name):

{service_name="{service}"} message_type:"intel_requests.created"
AND batch_id:~"c2a627b3|215181db|812f79b7|d334c245|2186cea6|7438ef94"
_time:[2026-03-10T13:00:00Z, 2026-03-10T20:00:00Z]
| stats by (batch_id) count(*) as log_lines

View raw logs: event-recorder | AIPT | ASU

service	`c2a627b3`	`215181db`	`812f79b7`	`d334c245`	`2186cea6`	`7438ef94`	total	÷2
event-recorder	348	44	136	128	44	108	808	404
AIPT	348	44	136	128	44	108	808	404
ASU	336	42	132	124	40	102	776	388

2. The previous RCA searched by the wrong ID

Cody provided intel_request_id values. These don't appear on message-processing log lines — those lines only have correlation_id (CID) and batch_id. To trace, you must map through the DB:

intel_request_id → metadata->>'correlation_id' (per-location, logged as CID)
                 → metadata->>'batch_id'       (per-request, = original correlation_id)

The batch_id is set to self.correlation_id in intel-requests-api (source). The mapping query was run on production-app-db-2-restore partition intel_requests_p20260301.

Evidence: DB query and ID mapping results

SELECT intel_request_id::text,
       (metadata->>'correlation_id')::text as correlation_id,
       (metadata->>'batch_id')::text as batch_id,
       created::text
FROM intel_requests.intel_requests_p20260301
WHERE intel_request_id IN ('56a219a6-...', 'bc2d6cef-...', ...)

intel_request_id	correlation_id	batch_id	created (UTC)
`56a219a6`	`11c84002`	`c2a627b3`	13:55:46
`bc2d6cef`	`e12e237a`	`c2a627b3`	13:55:46
`eac57da4`	`cec14379`	`215181db`	13:55:50
`27b6e33c`	`e64cd0a6`	`215181db`	13:55:50
`57007346`	`e5234837`	`812f79b7`	15:18:44
`819448ae`	`59355b93`	`812f79b7`	15:18:44
`55941d68`	`ab310e6b`	`812f79b7`	15:18:44
`9f3454be`	`af861b31`	`d334c245`	15:19:14
`2cfbd887`	`a578785d`	`2186cea6`	15:20:27
`90125c20`	`602b7ea5`	`7438ef94`	15:20:27

3. Logging is inconsistent across services — and Vector adds a layer

Why searching by intel_request_id is misleading:

Service	`intel_request_id` in VictoriaLogs?	Source	On `intel_requests.created` lines?	Proof
ASU	Yes	Vector extracts from `(Intel Request {uuid})` in `_msg` (VRL, Python)	No — only on `power_status.updated` discard lines	search returns 0 results, free-text finds only discard line
AIPT	Yes	Python logs it on `fetch prediction` step (source)	No — different log line	finds `fetch prediction`, not Processing/Processed
event-recorder	No	Only stores `correlation_id`, `organization_id` as metadata (source)	N/A	—

The shared amqp-consumer library (source) logs correlation_id, request_id, and batch_id on every Processing/Processed line. It never logs intel_request_id. These context fields are set at L632-642.

Vector's VRL pipeline (07-extract-ids.vrl) also extracts IDs from unstructured _msg text — (CID {uuid}) → correlation_id (VRL), (Batch ID {uuid}) → batch_id (VRL). All patterns use if !exists(...) guards so they don't conflict with structured JSON fields.

Impact: The structured fields in VictoriaLogs are correct for our batch-counting methodology. But searching by intel_request_id will find the wrong log lines (discard messages, not delivery proof).

4. What the previous RCA got right

Lock contention between incidents.received and intel_requests.created in message-amplifier IS real (confirmed in message-amplifier logs for batch c2a627b3)
The _on_message diagnostic logging in MR 17 is still the right next diagnostic step
Consumer-side loss IS happening — just only in ASU, not everywhere

5. Replay experiment independently confirms ASU-specific loss

A replay tool was built into the ASU image and run as K8s Jobs in production. It reads events from Loki (the same events event-recorder stored), replays them through ASU's Processor, and counts: 409 = already processed (ASU had it), 2xx = new (ASU missed it). The tool intentionally does NOT add 409 to allowed_errors so duplicates are counted separately (source).

200 replay jobs ran across 4 dates, covering multiple days of production data:

Run date	Jobs	New (missed)	409s (had it)	Loss rate
2026-01-15	35	4,438	428,977	1.024%
2026-01-16	112	608	2,968,829	0.021%
2026-01-17	5	272	3,024,879	0.009%
2026-02-16	48	183	522,793	0.035%
Total	200	5,501	6,945,478	0.079%

Verify: {service_name=~"replay-from-loki-hour-.*"} "TOTAL" over Jan 15 – Mar 16

The Jan 15 run had the highest rate (1.02%) because it was the first replay — subsequent runs found fewer new events as the replay itself backfilled gaps. The steady-state rate of ~0.02-0.04% on reruns suggests the loss is ongoing but at a lower rate than Cody's Mar 10 sample (0.93%).

Evidence: Sample replay output (hour 18, Dec 29 data)

Pod: replay-from-loki-hour-18-4bht6, ran 2026-01-17T13:47Z, replaying Dec 29 18:00-19:00 UTC

REPLAY SUMMARY
Total events fetched from Loki: 952,391
Successfully processed (new):   104
Duplicates (409):               872,513
Discarded (no API call):        79,774
Errors:                         0

By event type:
  Event Type                    New   409s   Skip
  asset_status.updated           13    788   2047
  incidents.deleted               0      5      7
  incidents.updated               0    532  22060
  intel_requests.created          1  41112     68
  power_status.updated            0  62941  55574
  search.expired                 90 767135     18
  TOTAL                         104 872513  79774

Verify: {service_name="replay-from-loki-hour-18-4bht6"} "REPLAY SUMMARY" OR "TOTAL" on Jan 17

6. Two distinct loss mechanisms identified

Acute — AMQP connection failures: On Jan 13 at 22:00 UTC, assets-api returned 500s which escalated into AMQP connection death (CHANNEL_ERROR, PRECONDITION_FAILED - unknown delivery tag, Failed to ACK message ... 'NoneType'). This single hour accounts for 1,975 of ~2,000 missed intel_requests.created events on that day (VLogs).

Evidence: Per-hour intel_requests.created replay results (Jan 13 data)

Hour (UTC)	New (missed)	409s	Loss rate	Note
00-21	0-4 each	5,000-16,000	<0.05%	Steady state
22	1,975	6,553	23.2%	AMQP connection death
23	13	6,711	0.19%	Tail of recovery

Verify: {service_name=~"replay-from-loki-hour-.*"} " intel_requests.created" on Jan 15

Chronic — asset_status.updated code path: The _process_asset_status_updated_message handler has a name-to-UUID resolution path (processor.py:554-634) that uses a Redis cache with 24h TTL. When the cache is cold or the asset doesn't exist yet, _get_assets_by_name returns [] → zero API calls → event silently discarded. Other event types (power_status.updated, intel_requests.created) use stable UUIDs from the message body and never hit this cache — explaining why they have 0% loss at steady state.

Event type	Asset identity	Cache dependency	Steady-state loss
`power_status.updated`	`intel_request_id` UUID in message	None	0%
`intel_requests.created`	`intel_request_id` UUID in message	None	~0% (only during AMQP failures)
`asset_status.updated`	Name→UUID via Redis cache (source)	24h TTL	~5%
`search.expired`	`intel_request_id` UUID in message	None	~0% (sporadic spikes)

Root Cause: Two Mechanisms

AMQP connection failures cause bulk loss of all event types during the failure window. Messages in-flight at connection death may fail to ACK, and RabbitMQ's PRECONDITION_FAILED on the new channel means the redelivered messages can't be acknowledged either. This is the dominant cause for intel_requests.created loss (Cody's original report).
Redis cache misses in _get_assets_by_name cause chronic asset_status.updated loss. When the cache returns empty (asset not yet created, or stale TTL), the handler returns [] and the event is silently discarded with no API call and no error log.

Next Steps

Replay March 10 data with enhanced logging (MR 29) — the NEW: lines now include correlation_id and batch_id, letting us cross-reference against ASU production logs to confirm whether each missed event was a delivery failure (CID absent from ASU logs) or a processing failure (CID present but no API call)
Deploy MR 17 on ASU with LOG_ON_MESSAGE=true — proves whether messages reach ASU's AMQP callback at all
Investigate AMQP reconnection behavior — the PRECONDITION_FAILED - unknown delivery tag error suggests messages processed on the old connection can't be ACKed after reconnect, potentially losing the entire in-flight batch
Consider adding logging to _get_assets_by_name when it returns [] — currently this is a silent discard

Runbook: Replay March 10 Data

Prerequisites

MR 29 merged and image built (docker.gisual.net/backend/consumers/asset-status-updater:latest)
March 10 data still in Loki retention (verify with step 1)
kubectl access to production cluster

Step 1: Verify Loki has March 10 data

kubectl run --rm -it loki-check --image=curlimages/curl --restart=Never -- \
  curl -s 'http://loki-gateway.monitoring.svc.cluster.local/loki/api/v1/query_range' \
  --data-urlencode 'query={source="events", name="intel_requests.created"}' \
  --data-urlencode 'start=1741564800' \
  --data-urlencode 'end=1741568400' \
  --data-urlencode 'limit=1' | head -c 200

If empty/error → data has aged out, stop here.

Step 2: Run replay for March 10 (24 parallel hourly jobs)

cd ~/src/gisual/backend/consumers/asset-status-updater/scripts

# Dry run first — prints manifests without applying
./generate-parallel-jobs.sh 2026-03-10 --dry-run | head -50

# Apply (ensure IMAGE_TAG points to the build with MR 29)
IMAGE_TAG=latest ./generate-parallel-jobs.sh 2026-03-10

Step 3: Monitor progress

# Watch job completion
kubectl get jobs -l app.kubernetes.io/name=replay-from-loki -w

# Tail a specific hour's logs
kubectl logs -f job/replay-from-loki-hour-13

# Check for errors across all jobs
kubectl logs -l app.kubernetes.io/name=replay-from-loki --tail=5 | grep -E "^(ERROR|REPLAY SUMMARY)"

Step 4: Collect REPLAY SUMMARY from all jobs

Wait for all 24 jobs to complete, then:

# Dump all summaries
for hour in $(seq -w 0 23); do
  echo "=== Hour $hour ==="
  kubectl logs job/replay-from-loki-hour-$hour 2>/dev/null | grep -A 20 "REPLAY SUMMARY"
  echo
done > /tmp/replay-march10-summaries.txt

Or via VictoriaLogs (after logs are shipped):

{service_name=~"replay-from-loki-hour-.*"} "TOTAL"

Time range: from job start to +4 hours.

Step 5: Extract NEW event IDs

# Get all NEW lines (events ASU missed)
for hour in $(seq -w 0 23); do
  kubectl logs job/replay-from-loki-hour-$hour 2>/dev/null | grep "^.*NEW:"
done > /tmp/replay-march10-new-events.txt

# Count by event type
cat /tmp/replay-march10-new-events.txt | grep -oP 'NEW: \S+' | sort | uniq -c | sort -rn

Step 6: Clean up jobs

kubectl delete jobs -l app.kubernetes.io/name=replay-from-loki

Runbook: Cross-Reference NEW Events Against ASU Production Logs

Once the replay completes and NEW: lines are collected:

Step 1: Extract correlation_ids from NEW lines

# Parse correlation_ids from the enhanced log format
grep "NEW:" /tmp/replay-march10-new-events.txt \
  | grep -oP 'correlation_id=\K[a-f0-9-]{36}' \
  | sort -u > /tmp/new-correlation-ids.txt

wc -l /tmp/new-correlation-ids.txt  # How many missed events

Step 2: Check if ASU ever received these messages

For each correlation_id, search ASU's production logs to see if it logged "Processing message type" for that CID:

{service_name="asset-status-updater"} correlation_id:"{cid}" AND "Processing"

Time range: March 10 00:00–24:00 UTC (start=1773100800000&end=1773187200000).

Batch check (sample 10 CIDs):

head -10 /tmp/new-correlation-ids.txt | while read cid; do
  echo -n "$cid: "
  curl -s 'http://vlogs-production.monitoring.svc.cluster.local/select/logsql/query' \
    --data-urlencode "query={service_name=\"asset-status-updater\"} correlation_id:\"$cid\" AND \"Processing\"" \
    --data-urlencode 'start=2026-03-10T00:00:00Z' \
    --data-urlencode 'end=2026-03-11T00:00:00Z' \
    --data-urlencode 'limit=1' | wc -c
  # >2 bytes = found, 0-2 bytes = not found
done

Step 3: Interpret results

ASU logged "Processing" for CID?	Meaning	Root cause
No	RabbitMQ never delivered the message to ASU	AMQP connection failure or queue-level loss
Yes, and handler made API calls	ASU processed it but the POST failed (transient error) and retries exhausted	Retry/error-exchange exhaustion
Yes, but 0 API calls (DISCARDED in replay)	`_get_assets_by_name` returned empty at original time	Redis cache miss or asset not yet created

Step 4: Correlate with AMQP errors

Check if the missed events cluster around AMQP error windows:

{service_name="asset-status-updater"} level:"error" _time:[2026-03-10T00:00:00Z, 2026-03-11T00:00:00Z] | stats by (_time:15m) count(*) as errors

If NEW events cluster in the same 15-minute windows as errors → AMQP connection failure is the dominant cause. If they're spread evenly → application-layer (cache/handler) issue.

Step 5: Document findings

Update this RCA with:

Total NEW count for March 10 (compare to Cody's 2,798)
Breakdown: delivery-layer vs application-layer loss
Whether the pattern matches Jan 13 (hour-22 spike) or is different