Skip to content

Instantly share code, notes, and snippets.

@jmealo
Last active March 16, 2026 23:16
Show Gist options
  • Select an option

  • Save jmealo/63d5c83972e1b9c3d74c1305fd0ce974 to your computer and use it in GitHub Desktop.

Select an option

Save jmealo/63d5c83972e1b9c3d74c1305fd0ce974 to your computer and use it in GitHub Desktop.
RCA: Missing intel_requests.created — Corrected Analysis (ENG-3349)

RCA: Missing intel_requests.created — Corrected Analysis (ENG-3349)

Status: PROVISIONAL — ASU-specific loss confirmed, root cause unknown Date: 2026-03-10 incident, 2026-03-16 re-analysis JIRA: ENG-3349 Previous RCA: Gist d81250648d (superseded by this document)


Summary

2,798 of 300,839 intel requests on March 10 never had facts created (~0.93%). Re-analysis of 6 sample batches (404 IRs) using correct identifiers shows:

  • event-recorder: 404/404 — 0% loss1
  • AIPT: 404/404 — 0% loss2
  • ASU: 388/404 — 3.96% loss, consistent across every batch3

The loss is ASU-specific. The previous RCA incorrectly attributed loss to all consumers due to a search methodology error4.

What the Previous RCA Got Wrong

Claim Correction
"event-recorder missed 8,146" 0% loss — received every message in all sampled batches1
"AIPT received but ASU didn't" AIPT also received 100%2
"Independent consumer-side shutdown race" Only ASU loses messages — shared library bug would affect all equally
MR 47: separate batch_id fixes lock contention Broke traceability, didn't address ASU loss — reverted5

The error: consumer logs index messages by correlation_id and batch_id, not intel_request_id. The original searched by intel_request_id, found nothing, and concluded messages were lost. In reality, the messages were there under different identifiers6.

Tracing Methodology

ID Mapping

Cody provided intel_request_id values. Consumer logs don't contain this field on processing lines7. The mapping is:

intel_request_id  →  DB metadata column[^db-query]
  └→ correlation_id  (unique per location, logged as CID)
  └→ batch_id        (shared per API request, = original correlation_id)[^batch-id-code]

10 sample IRs mapped to 6 distinct batches, all org 9de5d8018.

Counting Method

Each processed message produces exactly 2 log lines (Processing... + Processed...) from the shared amqp-consumer library9. Counting message_type:exact("intel_requests.created") lines per batch_id per service, divided by 2, gives IRs processed10.

Results

batch_id event-recorder AIPT ASU ASU missing
c2a627b3 174 174 168 6 (3.4%)
215181db 22 22 21 1 (4.5%)
812f79b7 68 68 66 2 (2.9%)
d334c245 64 64 62 2 (3.1%)
2186cea6 22 22 20 2 (9.1%)
7438ef94 54 54 51 3 (5.6%)
Total 404 404 388 16 (3.96%)

Why intel_request_id Appears as a Structured Field (But Misleads)

Vector's VRL pipeline11 extracts IDs from unstructured log text into structured VictoriaLogs fields:

Log pattern Extracted field VRL source
(Intel Request {uuid}) intel_request_id 12
(CID {uuid}) correlation_id 13
(Batch ID {uuid}) batch_id 14

ASU logs (Intel Request {id}): Received power_status.updated...15 on discard lines — Vector extracts this into intel_request_id. But the generic Processing/Processed message type: intel_requests.created lines9 don't contain that pattern, so intel_request_id is absent on the lines that prove message delivery.

This creates a trap: searching intel_request_id:exact("56a219a6...") finds ASU's power_status.updated discard (hours later, different batch) but misses the intel_requests.created processing entirely.

Event-recorder never logs individual IR IDs at all16 — it only stores event_type, correlation_id, and organization_id as structured metadata.

What We Don't Know

The root cause of ASU's ~4% loss is unidentified. Theories (all unverified):

  1. Silent discard in ASU handler — an early return path that doesn't log15
  2. Backpressure — ASU processing time (0.12-0.15s/msg) is 3x slower than event-recorder (0.05s/msg)
  3. Queue configuration — different prefetch/QoS or replica count

Next Steps

  1. Deploy _on_message logging (MR 1717) on ASU — determines whether messages reach ASU's callback at all
  2. Audit ASU handler for silent drops — check all return paths in intel_requests.created processing
  3. Compare ASU queue config to event-recorder and AIPT (prefetch, replicas, resource limits)

Verification Query (post MR 17 deployment)

service_name:exact("asset-status-updater") AND "Received message" AND "intel_requests.created"

Compare count to event-recorder's count for the same time window. If equal → loss is post-delivery (handler bug). If lower → loss is pre-delivery (RabbitMQ/queue issue).

Actions Taken

  • Reverted MR 475 — restored correlation_id as batch_id (traceability fix)
  • MR 1717 still open — diagnostic logging, pending Diego's review of shutdown protection

Grounding

Database

VictoriaLogs Queries

Grafana Verification Links

Source Code

Vector VRL (Log Field Extraction)

MRs

Footnotes

  1. event-recorder raw: c2a627b3=348, 215181db=44, 812f79b7=136, d334c245=128, 2186cea6=44, 7438ef94=108. Total=808 lines ÷2 = 404 IRs. 2

  2. AIPT raw: c2a627b3=348, 215181db=44, 812f79b7=136, d334c245=128, 2186cea6=44, 7438ef94=108. Total=808 lines ÷2 = 404 IRs. 2

  3. ASU raw: c2a627b3=336, 215181db=42, 812f79b7=132, d334c245=124, 2186cea6=40, 7438ef94=102. Total=776 lines ÷2 = 388 IRs.

  4. Original RCA gist did not preserve exact queries used. The gist states findings like "No log for IR 56a219a6" without query syntax. Based on field analysis, the search likely used intel_request_id or free-text IR UUID, which doesn't match intel_requests.created processing lines.

  5. Revert MR — intel-requests-api !48 (merged 2026-03-16). Restores self.correlation_id as batch_id. 2

  6. intel_request_id is stored in the DB row. correlation_id and batch_id are in metadata JSONB. Consumer logs only contain the latter two on processing lines.

  7. victorialogs_field_names for message_type:exact("intel_requests.created") per service, 2026-03-10T13:00-14:00Z. ASU: intel_request_id present (27,478 hits on co-stored lines, zero hits when filtered to intel_requests.created + specific IR ID). event-recorder: field absent entirely.

  8. SELECT intel_request_id::text, (metadata->>'correlation_id')::text, (metadata->>'batch_id')::text FROM intel_requests.intel_requests_p20260301 WHERE intel_request_id IN (...) on production-app-db-2-restore. 10 IRs → 6 batches, all org 9de5d801.

  9. amqp-consumer generic logging — gisual-amqp-consumer/app.py:657-664 (Processing) and L717-725 (Processed). Logs correlation_id, request_id, batch_id — never intel_request_id. Logger context fields set at L632-642. 2

  10. victorialogs_stats_query with service_name:exact("{service}") AND message_type:exact("intel_requests.created") AND (batch_id:exact("c2a627b3-...") OR ...) _time:[2026-03-10T13:00:00Z, 2026-03-10T20:00:00Z] | stats by (batch_id) count(*) as log_lines — run per service at time=2026-03-10T20:00:00Z.

  11. Vector VRL ID extraction pipeline — vector/vrl/07-extract-ids.vrl. All patterns use if !exists(...) guards (only extract from text when field absent from structured JSON).

  12. intel_request_id from (Intel Request {uuid})07-extract-ids.vrl:198-203.

  13. correlation_id from (CID {uuid})07-extract-ids.vrl:205-211.

  14. batch_id from (Batch ID {uuid})07-extract-ids.vrl:213-219.

  15. ASU power_on discard logging — processor.py:248-252. Logs intel_request_id in _msg text as (Intel Request %s). 2

  16. event-recorder Loki structured metadata — app.py:372-387. Only event_type, correlation_id, organization_id, organization_short_name.

  17. Diagnostic logging MR — gisual-amqp-consumer !17 (open, pending review). 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment