Last active
March 16, 2026 23:16
-
-
Save jmealo/63d5c83972e1b9c3d74c1305fd0ce974 to your computer and use it in GitHub Desktop.
Revisions
-
jmealo revised this gist
Mar 16, 2026 . 1 changed file with 6 additions and 4 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -162,7 +162,7 @@ By event type: --- ### 6. Two candidate loss mechanisms (not yet verified) **Acute — AMQP connection failures**: On Jan 13 at 22:00 UTC, assets-api returned 500s which escalated into AMQP connection death (`CHANNEL_ERROR`, `PRECONDITION_FAILED - unknown delivery tag`, `Failed to ACK message ... 'NoneType'`). This single hour accounts for **1,975 of ~2,000 missed `intel_requests.created` events** on that day ([VLogs][vlogs-asu-errors-jan13]). @@ -187,11 +187,13 @@ By event type: | `asset_status.updated` | Name→UUID via Redis cache ([source][asu-cache-code]) | 24h TTL | **~5%** | | `search.expired` | `intel_request_id` UUID in message | None | **~0%** (sporadic spikes) | ## Root Cause: Two Candidates (Not Yet Verified) The following are **correlated with the loss, not proven to cause it**. The March 10 replay (next step) will confirm or rule them out. 1. **AMQP connection failures** correlate with bulk loss during failure windows. On Jan 13 hour 22, assets-api 500s triggered `PRECONDITION_FAILED` and failed ACKs in the same hour that 1,975 `intel_requests.created` events went missing. But correlation ≠ causation — we haven't confirmed that ASU never received those specific messages (vs received but failed to process them). 2. **`_get_assets_by_name` Redis cache returning empty** can silently discard `asset_status.updated` events — the code path exists ([source][asu-cache-code]) and the replay shows ~5% "new" for that event type. But "new on replay" could also mean the original POST failed for other reasons, not just an empty cache. No production logging currently captures when this code path returns `[]`. ## Next Steps -
jmealo revised this gist
Mar 16, 2026 . 1 changed file with 15 additions and 15 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -310,7 +310,7 @@ For each correlation_id, search ASU's production logs to see if it logged "Proce {service_name="asset-status-updater"} correlation_id:"{cid}" AND "Processing" ``` Time range: March 10 00:00–24:00 UTC (`start=1773100800000&end=1773187200000`). **Batch check** (sample 10 CIDs): ```bash @@ -370,17 +370,17 @@ Update this RCA with: [asu-cache-code]: https://gitlab.com/gisual/backend/consumers/asset-status-updater/-/blob/d70ce19a11b0cf28f51b65c26584e5b98bb6fea9/asset_status_updater/processor.py#L554-634 [replay-code]: https://gitlab.com/gisual/backend/consumers/asset-status-updater/-/blob/main/asset_status_updater/cli/replay_from_loki.py [replay-409]: https://gitlab.com/gisual/backend/consumers/asset-status-updater/-/blob/main/asset_status_updater/cli/replay_from_loki.py#L126-197 [vlogs-asu]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22asset-status-updater%5C%22%7D%20message_type%3A%5C%22intel_requests.created%5C%22%20AND%20batch_id%3A%5C%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221773150600000%22%2C%22to%22%3A%221773151800000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-er]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22event-recorder%5C%22%7D%20message_type%3A%5C%22intel_requests.created%5C%22%20AND%20batch_id%3A%5C%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221773150600000%22%2C%22to%22%3A%221773151800000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-aipt]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22ai-prediction-tester%5C%22%7D%20message_type%3A%5C%22intel_requests.created%5C%22%20AND%20batch_id%3A%5C%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221773150600000%22%2C%22to%22%3A%221773151800000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-asu-all]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22asset-status-updater%5C%22%7D%20message_type%3A%5C%22intel_requests.created%5C%22%20AND%20batch_id%3A~%5C%22c2a627b3%7C215181db%7C812f79b7%7Cd334c245%7C2186cea6%7C7438ef94%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221773144000000%22%2C%22to%22%3A%221773172800000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-er-all]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22event-recorder%5C%22%7D%20message_type%3A%5C%22intel_requests.created%5C%22%20AND%20batch_id%3A~%5C%22c2a627b3%7C215181db%7C812f79b7%7Cd334c245%7C2186cea6%7C7438ef94%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221773144000000%22%2C%22to%22%3A%221773172800000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-aipt-all]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22ai-prediction-tester%5C%22%7D%20message_type%3A%5C%22intel_requests.created%5C%22%20AND%20batch_id%3A~%5C%22c2a627b3%7C215181db%7C812f79b7%7Cd334c245%7C2186cea6%7C7438ef94%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221773144000000%22%2C%22to%22%3A%221773172800000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-msgamp]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22message-amplifier%5C%22%7D%20%5C%22c2a627b3%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221773150600000%22%2C%22to%22%3A%221773151800000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-asu-ir-field]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22asset-status-updater%5C%22%7D%20message_type%3A%5C%22intel_requests.created%5C%22%20AND%20intel_request_id%3A%5C%2256a219a6-7597-42c7-85ea-1636cdfa8e53%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221773144000000%22%2C%22to%22%3A%221773172800000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-asu-ir-any]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22asset-status-updater%5C%22%7D%20%5C%2256a219a6%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221773144000000%22%2C%22to%22%3A%221773172800000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-aipt-ir]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22ai-prediction-tester%5C%22%7D%20intel_request_id%3A%5C%2256a219a6-7597-42c7-85ea-1636cdfa8e53%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221773144000000%22%2C%22to%22%3A%221773172800000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-asu-errors-jan13]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22asset-status-updater%5C%22%7D%20level%3A%5C%22error%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221768338000000%22%2C%22to%22%3A%221768345200000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-replay-ir-breakdown]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D~%5C%22replay-from-loki-hour-.%2A%5C%22%7D%20%5C%22%20%20intel_requests.created%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221768435200000%22%2C%22to%22%3A%221768521600000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-replay-totals]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D~%5C%22replay-from-loki-hour-.%2A%5C%22%7D%20%5C%22TOTAL%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221768435200000%22%2C%22to%22%3A%221773705599000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-replay-sample]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22replay-from-loki-hour-18-4bht6%5C%22%7D%20%5C%22REPLAY%20SUMMARY%5C%22%20OR%20%5C%22TOTAL%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221768608000000%22%2C%22to%22%3A%221768694400000%22%7D%2C%22compact%22%3Afalse%7D%7D -
jmealo revised this gist
Mar 16, 2026 . 1 changed file with 14 additions and 14 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -370,17 +370,17 @@ Update this RCA with: [asu-cache-code]: https://gitlab.com/gisual/backend/consumers/asset-status-updater/-/blob/d70ce19a11b0cf28f51b65c26584e5b98bb6fea9/asset_status_updater/processor.py#L554-634 [replay-code]: https://gitlab.com/gisual/backend/consumers/asset-status-updater/-/blob/main/asset_status_updater/cli/replay_from_loki.py [replay-409]: https://gitlab.com/gisual/backend/consumers/asset-status-updater/-/blob/main/asset_status_updater/cli/replay_from_loki.py#L126-197 [vlogs-asu]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22asset-status-updater%5C%22%7D%20message_type%3A%5C%22intel_requests.created%5C%22%20AND%20batch_id%3A%5C%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221741614600000%22%2C%22to%22%3A%221741615800000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-er]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22event-recorder%5C%22%7D%20message_type%3A%5C%22intel_requests.created%5C%22%20AND%20batch_id%3A%5C%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221741614600000%22%2C%22to%22%3A%221741615800000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-aipt]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22ai-prediction-tester%5C%22%7D%20message_type%3A%5C%22intel_requests.created%5C%22%20AND%20batch_id%3A%5C%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221741614600000%22%2C%22to%22%3A%221741615800000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-asu-all]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22asset-status-updater%5C%22%7D%20message_type%3A%5C%22intel_requests.created%5C%22%20AND%20batch_id%3A~%5C%22c2a627b3%7C215181db%7C812f79b7%7Cd334c245%7C2186cea6%7C7438ef94%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221741608000000%22%2C%22to%22%3A%221741636800000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-er-all]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22event-recorder%5C%22%7D%20message_type%3A%5C%22intel_requests.created%5C%22%20AND%20batch_id%3A~%5C%22c2a627b3%7C215181db%7C812f79b7%7Cd334c245%7C2186cea6%7C7438ef94%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221741608000000%22%2C%22to%22%3A%221741636800000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-aipt-all]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22ai-prediction-tester%5C%22%7D%20message_type%3A%5C%22intel_requests.created%5C%22%20AND%20batch_id%3A~%5C%22c2a627b3%7C215181db%7C812f79b7%7Cd334c245%7C2186cea6%7C7438ef94%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221741608000000%22%2C%22to%22%3A%221741636800000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-msgamp]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22message-amplifier%5C%22%7D%20%5C%22c2a627b3%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221741614600000%22%2C%22to%22%3A%221741615800000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-asu-ir-field]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22asset-status-updater%5C%22%7D%20message_type%3A%5C%22intel_requests.created%5C%22%20AND%20intel_request_id%3A%5C%2256a219a6-7597-42c7-85ea-1636cdfa8e53%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221741608000000%22%2C%22to%22%3A%221741636800000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-asu-ir-any]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22asset-status-updater%5C%22%7D%20%5C%2256a219a6%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221741608000000%22%2C%22to%22%3A%221741636800000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-aipt-ir]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22ai-prediction-tester%5C%22%7D%20intel_request_id%3A%5C%2256a219a6-7597-42c7-85ea-1636cdfa8e53%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221741608000000%22%2C%22to%22%3A%221741636800000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-asu-errors-jan13]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22asset-status-updater%5C%22%7D%20level%3A%5C%22error%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221736805600000%22%2C%22to%22%3A%221736816400000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-replay-ir-breakdown]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D~%5C%22replay-from-loki-hour-.%2A%5C%22%7D%20%5C%22%20%20intel_requests.created%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221736899200000%22%2C%22to%22%3A%221736985600000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-replay-totals]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D~%5C%22replay-from-loki-hour-.%2A%5C%22%7D%20%5C%22TOTAL%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221736899200000%22%2C%22to%22%3A%221742169599000%22%7D%2C%22compact%22%3Afalse%7D%7D [vlogs-replay-sample]: http://grafana-production.monitoring.svc.cluster.local/explore?schemaVersion=1&panes=%7B%22rca%22%3A%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22datasource%22%3A%7B%22type%22%3A%22victoriametrics-logs-datasource%22%2C%22uid%22%3A%22PD775F2863313E6C7%22%7D%2C%22editorMode%22%3A%22code%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22replay-from-loki-hour-18-4bht6%5C%22%7D%20%5C%22REPLAY%20SUMMARY%5C%22%20OR%20%5C%22TOTAL%5C%22%22%2C%22queryType%22%3A%22instant%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%221737072000000%22%2C%22to%22%3A%221737158400000%22%7D%2C%22compact%22%3Afalse%7D%7D -
jmealo revised this gist
Mar 16, 2026 . 1 changed file with 20 additions and 25 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -41,13 +41,8 @@ Counted `intel_requests.created` processing logs per batch per consumer. Each me Query (run per service — replace `{service}` with service name): ``` {service_name="{service}"} message_type:"intel_requests.created" AND batch_id:~"c2a627b3|215181db|812f79b7|d334c245|2186cea6|7438ef94" _time:[2026-03-10T13:00:00Z, 2026-03-10T20:00:00Z] | stats by (batch_id) count(*) as log_lines ``` @@ -312,7 +307,7 @@ wc -l /tmp/new-correlation-ids.txt # How many missed events For each correlation_id, search ASU's production logs to see if it logged "Processing message type" for that CID: ``` {service_name="asset-status-updater"} correlation_id:"{cid}" AND "Processing" ``` Time range: March 10 00:00–24:00 UTC (`start=1741564800000&end=1741651200000`). @@ -322,7 +317,7 @@ Time range: March 10 00:00–24:00 UTC (`start=1741564800000&end=1741651200000`) head -10 /tmp/new-correlation-ids.txt | while read cid; do echo -n "$cid: " curl -s 'http://vlogs-production.monitoring.svc.cluster.local/select/logsql/query' \ --data-urlencode "query={service_name=\"asset-status-updater\"} correlation_id:\"$cid\" AND \"Processing\"" \ --data-urlencode 'start=2026-03-10T00:00:00Z' \ --data-urlencode 'end=2026-03-11T00:00:00Z' \ --data-urlencode 'limit=1' | wc -c @@ -343,7 +338,7 @@ done Check if the missed events cluster around AMQP error windows: ``` {service_name="asset-status-updater"} level:"error" _time:[2026-03-10T00:00:00Z, 2026-03-11T00:00:00Z] | stats by (_time:15m) count(*) as errors ``` If NEW events cluster in the same 15-minute windows as errors → AMQP connection failure is the dominant cause. If they're spread evenly → application-layer (cache/handler) issue. @@ -357,7 +352,7 @@ Update this RCA with: --- <!-- Reference-style links — Grafana Explore with VictoriaLogs datasource, time ranges set via from/to epoch ms --> [lib-log]: https://gitlab.com/gisual/backend/libraries/gisual-amqp-consumer/-/blob/f2b9dc354f360c7489a2e0aef3f044cc6e1b3f69/gisual_amqp_consumer/app.py#L657-725 [lib-context]: https://gitlab.com/gisual/backend/libraries/gisual-amqp-consumer/-/blob/f2b9dc354f360c7489a2e0aef3f044cc6e1b3f69/gisual_amqp_consumer/app.py#L632-642 @@ -375,17 +370,17 @@ Update this RCA with: [asu-cache-code]: https://gitlab.com/gisual/backend/consumers/asset-status-updater/-/blob/d70ce19a11b0cf28f51b65c26584e5b98bb6fea9/asset_status_updater/processor.py#L554-634 [replay-code]: https://gitlab.com/gisual/backend/consumers/asset-status-updater/-/blob/main/asset_status_updater/cli/replay_from_loki.py [replay-409]: https://gitlab.com/gisual/backend/consumers/asset-status-updater/-/blob/main/asset_status_updater/cli/replay_from_loki.py#L126-197 [vlogs-asu]: http://grafana-production.monitoring.svc.cluster.local:80/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%7D&from=1741614600000&to=1741615800000&expr=%7Bservice_name%3D%22asset-status-updater%22%7D%20message_type%3A%22intel_requests.created%22%20AND%20batch_id%3A%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%22 [vlogs-er]: http://grafana-production.monitoring.svc.cluster.local:80/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%7D&from=1741614600000&to=1741615800000&expr=%7Bservice_name%3D%22event-recorder%22%7D%20message_type%3A%22intel_requests.created%22%20AND%20batch_id%3A%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%22 [vlogs-aipt]: http://grafana-production.monitoring.svc.cluster.local:80/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%7D&from=1741614600000&to=1741615800000&expr=%7Bservice_name%3D%22ai-prediction-tester%22%7D%20message_type%3A%22intel_requests.created%22%20AND%20batch_id%3A%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%22 [vlogs-asu-all]: http://grafana-production.monitoring.svc.cluster.local:80/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%7D&from=1741608000000&to=1741636800000&expr=%7Bservice_name%3D%22asset-status-updater%22%7D%20message_type%3A%22intel_requests.created%22%20AND%20batch_id%3A~%22c2a627b3%7C215181db%7C812f79b7%7Cd334c245%7C2186cea6%7C7438ef94%22 [vlogs-er-all]: http://grafana-production.monitoring.svc.cluster.local:80/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%7D&from=1741608000000&to=1741636800000&expr=%7Bservice_name%3D%22event-recorder%22%7D%20message_type%3A%22intel_requests.created%22%20AND%20batch_id%3A~%22c2a627b3%7C215181db%7C812f79b7%7Cd334c245%7C2186cea6%7C7438ef94%22 [vlogs-aipt-all]: http://grafana-production.monitoring.svc.cluster.local:80/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%7D&from=1741608000000&to=1741636800000&expr=%7Bservice_name%3D%22ai-prediction-tester%22%7D%20message_type%3A%22intel_requests.created%22%20AND%20batch_id%3A~%22c2a627b3%7C215181db%7C812f79b7%7Cd334c245%7C2186cea6%7C7438ef94%22 [vlogs-msgamp]: http://grafana-production.monitoring.svc.cluster.local:80/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%7D&from=1741614600000&to=1741615800000&expr=%7Bservice_name%3D%22message-amplifier%22%7D%20%22c2a627b3%22 [vlogs-asu-ir-field]: http://grafana-production.monitoring.svc.cluster.local:80/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%7D&from=1741608000000&to=1741636800000&expr=%7Bservice_name%3D%22asset-status-updater%22%7D%20message_type%3A%22intel_requests.created%22%20AND%20intel_request_id%3A%2256a219a6-7597-42c7-85ea-1636cdfa8e53%22 [vlogs-asu-ir-any]: http://grafana-production.monitoring.svc.cluster.local:80/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%7D&from=1741608000000&to=1741636800000&expr=%7Bservice_name%3D%22asset-status-updater%22%7D%20%2256a219a6%22 [vlogs-aipt-ir]: http://grafana-production.monitoring.svc.cluster.local:80/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%7D&from=1741608000000&to=1741636800000&expr=%7Bservice_name%3D%22ai-prediction-tester%22%7D%20intel_request_id%3A%2256a219a6-7597-42c7-85ea-1636cdfa8e53%22 [vlogs-asu-errors-jan13]: http://grafana-production.monitoring.svc.cluster.local:80/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%7D&from=1736805600000&to=1736816400000&expr=%7Bservice_name%3D%22asset-status-updater%22%7D%20level%3A%22error%22 [vlogs-replay-ir-breakdown]: http://grafana-production.monitoring.svc.cluster.local:80/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%7D&from=1736899200000&to=1736985600000&expr=%7Bservice_name%3D~%22replay-from-loki-hour-.%2A%22%7D%20%22%20%20intel_requests.created%22 [vlogs-replay-totals]: http://grafana-production.monitoring.svc.cluster.local:80/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%7D&from=1736899200000&to=1742169599000&expr=%7Bservice_name%3D~%22replay-from-loki-hour-.%2A%22%7D%20%22TOTAL%22 [vlogs-replay-sample]: http://grafana-production.monitoring.svc.cluster.local:80/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%7D&from=1737072000000&to=1737158400000&expr=%7Bservice_name%3D%22replay-from-loki-hour-18-4bht6%22%7D%20%22REPLAY%20SUMMARY%22%20OR%20%22TOTAL%22 -
jmealo revised this gist
Mar 16, 2026 . 1 changed file with 24 additions and 27 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -41,7 +41,7 @@ Counted `intel_requests.created` processing logs per batch per consumer. Each me Query (run per service — replace `{service}` with service name): ``` {service_name="{service}"} message_type:exact("intel_requests.created") AND (batch_id:exact("c2a627b3-3f7c-4c89-808c-4d79b19c69cb") OR batch_id:exact("215181db-45e2-474f-9c05-839eecf61cc5") OR batch_id:exact("812f79b7-c50d-4fd2-ac28-cc5d0318ee23") @@ -312,21 +312,20 @@ wc -l /tmp/new-correlation-ids.txt # How many missed events For each correlation_id, search ASU's production logs to see if it logged "Processing message type" for that CID: ``` {service_name="asset-status-updater"} correlation_id:exact("{cid}") AND "Processing" ``` Time range: March 10 00:00–24:00 UTC (`start=1741564800000&end=1741651200000`). **Batch check** (sample 10 CIDs): ```bash head -10 /tmp/new-correlation-ids.txt | while read cid; do echo -n "$cid: " curl -s 'http://vlogs-production.monitoring.svc.cluster.local/select/logsql/query' \ --data-urlencode "query={service_name=\"asset-status-updater\"} correlation_id:exact(\"$cid\") AND \"Processing\"" \ --data-urlencode 'start=2026-03-10T00:00:00Z' \ --data-urlencode 'end=2026-03-11T00:00:00Z' \ --data-urlencode 'limit=1' | wc -c # >2 bytes = found, 0-2 bytes = not found done ``` @@ -344,9 +343,7 @@ done Check if the missed events cluster around AMQP error windows: ``` {service_name="asset-status-updater"} level:exact("error") _time:[2026-03-10T00:00:00Z, 2026-03-11T00:00:00Z] | stats by (_time:15m) count(*) as errors ``` If NEW events cluster in the same 15-minute windows as errors → AMQP connection failure is the dominant cause. If they're spread evenly → application-layer (cache/handler) issue. @@ -360,7 +357,7 @@ Update this RCA with: --- <!-- Reference-style links — Grafana Explore (VictoriaLogs datasource) with time ranges --> [lib-log]: https://gitlab.com/gisual/backend/libraries/gisual-amqp-consumer/-/blob/f2b9dc354f360c7489a2e0aef3f044cc6e1b3f69/gisual_amqp_consumer/app.py#L657-725 [lib-context]: https://gitlab.com/gisual/backend/libraries/gisual-amqp-consumer/-/blob/f2b9dc354f360c7489a2e0aef3f044cc6e1b3f69/gisual_amqp_consumer/app.py#L632-642 @@ -374,21 +371,21 @@ Update this RCA with: [vrl-batch]: https://gitlab.com/gisual/configuration-management/kubernetes/-/blob/main/applications/infrastructure/charts/vector/vrl/07-extract-ids.vrl#L213-219 [mr17]: https://gitlab.com/gisual/backend/libraries/gisual-amqp-consumer/-/merge_requests/17 [mr48]: https://gitlab.com/gisual/backend/apis/intel-requests-api/-/merge_requests/48 [mr29]: https://gitlab.com/gisual/backend/consumers/asset-status-updater/-/merge_requests/29 [asu-cache-code]: https://gitlab.com/gisual/backend/consumers/asset-status-updater/-/blob/d70ce19a11b0cf28f51b65c26584e5b98bb6fea9/asset_status_updater/processor.py#L554-634 [replay-code]: https://gitlab.com/gisual/backend/consumers/asset-status-updater/-/blob/main/asset_status_updater/cli/replay_from_loki.py [replay-409]: https://gitlab.com/gisual/backend/consumers/asset-status-updater/-/blob/main/asset_status_updater/cli/replay_from_loki.py#L126-197 [vlogs-asu]: http://grafana-production.monitoring.svc.cluster.local/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22asset-status-updater%5C%22%7D+message_type%3Aexact%28%5C%22intel_requests.created%5C%22%29+AND+batch_id%3Aexact%28%5C%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%5C%22%29%22%7D%5D%7D&from=1741614600000&to=1741615800000 [vlogs-er]: http://grafana-production.monitoring.svc.cluster.local/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22event-recorder%5C%22%7D+message_type%3Aexact%28%5C%22intel_requests.created%5C%22%29+AND+batch_id%3Aexact%28%5C%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%5C%22%29%22%7D%5D%7D&from=1741614600000&to=1741615800000 [vlogs-aipt]: http://grafana-production.monitoring.svc.cluster.local/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22ai-prediction-tester%5C%22%7D+message_type%3Aexact%28%5C%22intel_requests.created%5C%22%29+AND+batch_id%3Aexact%28%5C%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%5C%22%29%22%7D%5D%7D&from=1741614600000&to=1741615800000 [vlogs-asu-all]: http://grafana-production.monitoring.svc.cluster.local/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22asset-status-updater%5C%22%7D+message_type%3Aexact%28%5C%22intel_requests.created%5C%22%29+AND+%28batch_id%3Aexact%28%5C%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%5C%22%29+OR+batch_id%3Aexact%28%5C%22215181db-45e2-474f-9c05-839eecf61cc5%5C%22%29+OR+batch_id%3Aexact%28%5C%22812f79b7-c50d-4fd2-ac28-cc5d0318ee23%5C%22%29+OR+batch_id%3Aexact%28%5C%22d334c245-898b-4caa-b96c-954b8be31414%5C%22%29+OR+batch_id%3Aexact%28%5C%222186cea6-3ae5-40c1-805e-878cc123ca5c%5C%22%29+OR+batch_id%3Aexact%28%5C%227438ef94-89a7-48ae-83d2-71eb8a138f52%5C%22%29%29%22%7D%5D%7D&from=1741608000000&to=1741636800000 [vlogs-er-all]: http://grafana-production.monitoring.svc.cluster.local/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22event-recorder%5C%22%7D+message_type%3Aexact%28%5C%22intel_requests.created%5C%22%29+AND+%28batch_id%3Aexact%28%5C%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%5C%22%29+OR+batch_id%3Aexact%28%5C%22215181db-45e2-474f-9c05-839eecf61cc5%5C%22%29+OR+batch_id%3Aexact%28%5C%22812f79b7-c50d-4fd2-ac28-cc5d0318ee23%5C%22%29+OR+batch_id%3Aexact%28%5C%22d334c245-898b-4caa-b96c-954b8be31414%5C%22%29+OR+batch_id%3Aexact%28%5C%222186cea6-3ae5-40c1-805e-878cc123ca5c%5C%22%29+OR+batch_id%3Aexact%28%5C%227438ef94-89a7-48ae-83d2-71eb8a138f52%5C%22%29%29%22%7D%5D%7D&from=1741608000000&to=1741636800000 [vlogs-aipt-all]: http://grafana-production.monitoring.svc.cluster.local/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22ai-prediction-tester%5C%22%7D+message_type%3Aexact%28%5C%22intel_requests.created%5C%22%29+AND+%28batch_id%3Aexact%28%5C%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%5C%22%29+OR+batch_id%3Aexact%28%5C%22215181db-45e2-474f-9c05-839eecf61cc5%5C%22%29+OR+batch_id%3Aexact%28%5C%22812f79b7-c50d-4fd2-ac28-cc5d0318ee23%5C%22%29+OR+batch_id%3Aexact%28%5C%22d334c245-898b-4caa-b96c-954b8be31414%5C%22%29+OR+batch_id%3Aexact%28%5C%222186cea6-3ae5-40c1-805e-878cc123ca5c%5C%22%29+OR+batch_id%3Aexact%28%5C%227438ef94-89a7-48ae-83d2-71eb8a138f52%5C%22%29%29%22%7D%5D%7D&from=1741608000000&to=1741636800000 [vlogs-msgamp]: http://grafana-production.monitoring.svc.cluster.local/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22message-amplifier%5C%22%7D+%5C%22c2a627b3%5C%22%22%7D%5D%7D&from=1741614600000&to=1741615800000 [vlogs-asu-ir-field]: http://grafana-production.monitoring.svc.cluster.local/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22asset-status-updater%5C%22%7D+message_type%3Aexact%28%5C%22intel_requests.created%5C%22%29+AND+intel_request_id%3Aexact%28%5C%2256a219a6-7597-42c7-85ea-1636cdfa8e53%5C%22%29%22%7D%5D%7D&from=1741608000000&to=1741636800000 [vlogs-asu-ir-any]: http://grafana-production.monitoring.svc.cluster.local/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22asset-status-updater%5C%22%7D+%5C%2256a219a6%5C%22%22%7D%5D%7D&from=1741608000000&to=1741636800000 [vlogs-aipt-ir]: http://grafana-production.monitoring.svc.cluster.local/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22ai-prediction-tester%5C%22%7D+intel_request_id%3Aexact%28%5C%2256a219a6-7597-42c7-85ea-1636cdfa8e53%5C%22%29%22%7D%5D%7D&from=1741608000000&to=1741636800000 [vlogs-asu-errors-jan13]: http://grafana-production.monitoring.svc.cluster.local/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22asset-status-updater%5C%22%7D+level%3Aexact%28%5C%22error%5C%22%29%22%7D%5D%7D&from=1736805600000&to=1736816400000 [vlogs-replay-ir-breakdown]: http://grafana-production.monitoring.svc.cluster.local/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22expr%22%3A%22%7Bservice_name%3D~%5C%22replay-from-loki-hour-.%2A%5C%22%7D+%5C%22++intel_requests.created%5C%22%22%7D%5D%7D&from=1736899200000&to=1736985600000 [vlogs-replay-totals]: http://grafana-production.monitoring.svc.cluster.local/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22expr%22%3A%22%7Bservice_name%3D~%5C%22replay-from-loki-hour-.%2A%5C%22%7D+%5C%22TOTAL%5C%22%22%7D%5D%7D&from=1736899200000&to=1742169599000 [vlogs-replay-sample]: http://grafana-production.monitoring.svc.cluster.local/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22expr%22%3A%22%7Bservice_name%3D%5C%22replay-from-loki-hour-18-4bht6%5C%22%7D+%5C%22REPLAY+SUMMARY%5C%22+OR+%5C%22TOTAL%5C%22%22%7D%5D%7D&from=1737072000000&to=1737158400000 -
jmealo revised this gist
Mar 16, 2026 . 1 changed file with 150 additions and 3 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -205,11 +205,158 @@ By event type: 3. **Investigate AMQP reconnection behavior** — the `PRECONDITION_FAILED - unknown delivery tag` error suggests messages processed on the old connection can't be ACKed after reconnect, potentially losing the entire in-flight batch 4. **Consider adding logging to `_get_assets_by_name`** when it returns `[]` — currently this is a silent discard --- ## Runbook: Replay March 10 Data ### Prerequisites - [MR 29][mr29] merged and image built (`docker.gisual.net/backend/consumers/asset-status-updater:latest`) - March 10 data still in Loki retention (verify with step 1) - `kubectl` access to production cluster ### Step 1: Verify Loki has March 10 data ```bash kubectl run --rm -it loki-check --image=curlimages/curl --restart=Never -- \ curl -s 'http://loki-gateway.monitoring.svc.cluster.local/loki/api/v1/query_range' \ --data-urlencode 'query={source="events", name="intel_requests.created"}' \ --data-urlencode 'start=1741564800' \ --data-urlencode 'end=1741568400' \ --data-urlencode 'limit=1' | head -c 200 ``` If empty/error → data has aged out, stop here. ### Step 2: Run replay for March 10 (24 parallel hourly jobs) ```bash cd ~/src/gisual/backend/consumers/asset-status-updater/scripts # Dry run first — prints manifests without applying ./generate-parallel-jobs.sh 2026-03-10 --dry-run | head -50 # Apply (ensure IMAGE_TAG points to the build with MR 29) IMAGE_TAG=latest ./generate-parallel-jobs.sh 2026-03-10 ``` ### Step 3: Monitor progress ```bash # Watch job completion kubectl get jobs -l app.kubernetes.io/name=replay-from-loki -w # Tail a specific hour's logs kubectl logs -f job/replay-from-loki-hour-13 # Check for errors across all jobs kubectl logs -l app.kubernetes.io/name=replay-from-loki --tail=5 | grep -E "^(ERROR|REPLAY SUMMARY)" ``` ### Step 4: Collect REPLAY SUMMARY from all jobs Wait for all 24 jobs to complete, then: ```bash # Dump all summaries for hour in $(seq -w 0 23); do echo "=== Hour $hour ===" kubectl logs job/replay-from-loki-hour-$hour 2>/dev/null | grep -A 20 "REPLAY SUMMARY" echo done > /tmp/replay-march10-summaries.txt ``` Or via VictoriaLogs (after logs are shipped): ``` {service_name=~"replay-from-loki-hour-.*"} "TOTAL" ``` Time range: from job start to +4 hours. ### Step 5: Extract NEW event IDs ```bash # Get all NEW lines (events ASU missed) for hour in $(seq -w 0 23); do kubectl logs job/replay-from-loki-hour-$hour 2>/dev/null | grep "^.*NEW:" done > /tmp/replay-march10-new-events.txt # Count by event type cat /tmp/replay-march10-new-events.txt | grep -oP 'NEW: \S+' | sort | uniq -c | sort -rn ``` ### Step 6: Clean up jobs ```bash kubectl delete jobs -l app.kubernetes.io/name=replay-from-loki ``` --- ## Runbook: Cross-Reference NEW Events Against ASU Production Logs Once the replay completes and `NEW:` lines are collected: ### Step 1: Extract correlation_ids from NEW lines ```bash # Parse correlation_ids from the enhanced log format grep "NEW:" /tmp/replay-march10-new-events.txt \ | grep -oP 'correlation_id=\K[a-f0-9-]{36}' \ | sort -u > /tmp/new-correlation-ids.txt wc -l /tmp/new-correlation-ids.txt # How many missed events ``` ### Step 2: Check if ASU ever received these messages For each correlation_id, search ASU's production logs to see if it logged "Processing message type" for that CID: ``` {service_name="asset-status-updater"} correlation_id:exact("{cid}") "Processing" ``` Time range: March 10 00:00–24:00 UTC. **Batch check** (sample 10 CIDs): ```bash head -10 /tmp/new-correlation-ids.txt | while read cid; do echo -n "$cid: " # Use the MCP tool or curl to VictoriaLogs curl -s "http://vlogs-production.monitoring.svc.cluster.local/select/logsql/query" \ --data-urlencode "query={service_name=\"asset-status-updater\"} correlation_id:exact(\"$cid\") \"Processing\"" \ --data-urlencode "start=2026-03-10T00:00:00Z" \ --data-urlencode "end=2026-03-11T00:00:00Z" \ --data-urlencode "limit=1" | wc -c # >2 bytes = found, 0-2 bytes = not found done ``` ### Step 3: Interpret results | ASU logged "Processing" for CID? | Meaning | Root cause | |---|---|---| | **No** | RabbitMQ never delivered the message to ASU | AMQP connection failure or queue-level loss | | **Yes**, and handler made API calls | ASU processed it but the POST failed (transient error) and retries exhausted | Retry/error-exchange exhaustion | | **Yes**, but 0 API calls (DISCARDED in replay) | `_get_assets_by_name` returned empty at original time | Redis cache miss or asset not yet created | ### Step 4: Correlate with AMQP errors Check if the missed events cluster around AMQP error windows: ``` {service_name="asset-status-updater"} level:exact("error") _time:[2026-03-10T00:00:00Z, 2026-03-11T00:00:00Z] | stats by (_time:15m) count(*) as errors ``` If NEW events cluster in the same 15-minute windows as errors → AMQP connection failure is the dominant cause. If they're spread evenly → application-layer (cache/handler) issue. ### Step 5: Document findings Update this RCA with: - Total NEW count for March 10 (compare to Cody's 2,798) - Breakdown: delivery-layer vs application-layer loss - Whether the pattern matches Jan 13 (hour-22 spike) or is different --- -
jmealo revised this gist
Mar 16, 2026 . 1 changed file with 41 additions and 9 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -167,21 +167,49 @@ By event type: --- ### 6. Two distinct loss mechanisms identified **Acute — AMQP connection failures**: On Jan 13 at 22:00 UTC, assets-api returned 500s which escalated into AMQP connection death (`CHANNEL_ERROR`, `PRECONDITION_FAILED - unknown delivery tag`, `Failed to ACK message ... 'NoneType'`). This single hour accounts for **1,975 of ~2,000 missed `intel_requests.created` events** on that day ([VLogs][vlogs-asu-errors-jan13]). <details> <summary><strong>Evidence:</strong> Per-hour intel_requests.created replay results (Jan 13 data)</summary> | Hour (UTC) | New (missed) | 409s | Loss rate | Note | |---|---|---|---|---| | 00-21 | 0-4 each | 5,000-16,000 | <0.05% | Steady state | | **22** | **1,975** | **6,553** | **23.2%** | AMQP connection death | | 23 | 13 | 6,711 | 0.19% | Tail of recovery | > **Verify**: `{service_name=~"replay-from-loki-hour-.*"} " intel_requests.created"` on [Jan 15][vlogs-replay-ir-breakdown] </details> **Chronic — `asset_status.updated` code path**: The `_process_asset_status_updated_message` handler has a name-to-UUID resolution path ([processor.py:554-634][asu-cache-code]) that uses a Redis cache with 24h TTL. When the cache is cold or the asset doesn't exist yet, `_get_assets_by_name` returns `[]` → zero API calls → event silently discarded. Other event types (`power_status.updated`, `intel_requests.created`) use stable UUIDs from the message body and never hit this cache — explaining why they have 0% loss at steady state. | Event type | Asset identity | Cache dependency | Steady-state loss | |---|---|---|---| | `power_status.updated` | `intel_request_id` UUID in message | None | **0%** | | `intel_requests.created` | `intel_request_id` UUID in message | None | **~0%** (only during AMQP failures) | | `asset_status.updated` | Name→UUID via Redis cache ([source][asu-cache-code]) | 24h TTL | **~5%** | | `search.expired` | `intel_request_id` UUID in message | None | **~0%** (sporadic spikes) | ## Root Cause: Two Mechanisms 1. **AMQP connection failures** cause bulk loss of all event types during the failure window. Messages in-flight at connection death may fail to ACK, and RabbitMQ's `PRECONDITION_FAILED` on the new channel means the redelivered messages can't be acknowledged either. This is the dominant cause for `intel_requests.created` loss (Cody's original report). 2. **Redis cache misses in `_get_assets_by_name`** cause chronic `asset_status.updated` loss. When the cache returns empty (asset not yet created, or stale TTL), the handler returns `[]` and the event is silently discarded with no API call and no error log. ## Next Steps 1. **Replay March 10 data** with enhanced logging ([MR 29][mr29]) — the `NEW:` lines now include `correlation_id` and `batch_id`, letting us cross-reference against ASU production logs to confirm whether each missed event was a delivery failure (CID absent from ASU logs) or a processing failure (CID present but no API call) 2. **Deploy [MR 17][mr17] on ASU** with `LOG_ON_MESSAGE=true` — proves whether messages reach ASU's AMQP callback at all 3. **Investigate AMQP reconnection behavior** — the `PRECONDITION_FAILED - unknown delivery tag` error suggests messages processed on the old connection can't be ACKed after reconnect, potentially losing the entire in-flight batch 4. **Consider adding logging to `_get_assets_by_name`** when it returns `[]` — currently this is a silent discard **Verification after replay**: grep `NEW:` lines from replay logs, extract `correlation_id`s, then: ``` {service_name="asset-status-updater"} correlation_id:exact("{cid}") AND "Processing" ``` Present = ASU received it but processing failed silently. Absent = RabbitMQ never delivered it. --- @@ -209,6 +237,10 @@ ASU loses messages at a consistent low rate (~0.08% overall, spiking to ~1% on s [vlogs-asu-ir-field]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/?query=service_name%3Aexact%28%22asset-status-updater%22%29%20AND%20message_type%3Aexact%28%22intel_requests.created%22%29%20AND%20intel_request_id%3Aexact%28%2256a219a6-7597-42c7-85ea-1636cdfa8e53%22%29&start=1741608000000&end=1741636800000 [vlogs-asu-ir-any]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/?query=service_name%3Aexact%28%22asset-status-updater%22%29%20AND%20%2256a219a6%22&start=1741608000000&end=1741636800000 [vlogs-aipt-ir]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/?query=service_name%3Aexact%28%22ai-prediction-tester%22%29%20AND%20intel_request_id%3Aexact%28%2256a219a6-7597-42c7-85ea-1636cdfa8e53%22%29&start=1741608000000&end=1741636800000 [mr29]: https://gitlab.com/gisual/backend/consumers/asset-status-updater/-/merge_requests/29 [asu-cache-code]: https://gitlab.com/gisual/backend/consumers/asset-status-updater/-/blob/d70ce19a11b0cf28f51b65c26584e5b98bb6fea9/asset_status_updater/processor.py#L554-634 [vlogs-asu-errors-jan13]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/?query=%7Bservice_name%3D%22asset-status-updater%22%7D%20level%3Aexact%28%22error%22%29&start=1736805600000&end=1736816400000 [vlogs-replay-ir-breakdown]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/?query=%7Bservice_name%3D~%22replay-from-loki-hour-.%2A%22%7D%20%22%20%20intel_requests.created%22&start=1736899200000&end=1736985600000 [replay-code]: https://gitlab.com/gisual/backend/consumers/asset-status-updater/-/blob/main/asset_status_updater/cli/replay_from_loki.py [replay-409]: https://gitlab.com/gisual/backend/consumers/asset-status-updater/-/blob/main/asset_status_updater/cli/replay_from_loki.py#L126-197 [vlogs-replay-totals]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/?query=%7Bservice_name%3D~%22replay-from-loki-hour-.%2A%22%7D%20%22TOTAL%22&start=1736899200000&end=1742169599000 -
jmealo revised this gist
Mar 16, 2026 . 1 changed file with 50 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -120,11 +120,56 @@ Vector's VRL pipeline ([`07-extract-ids.vrl`][vrl-file]) also extracts IDs from - The `_on_message` diagnostic logging in [MR 17][mr17] is still the right next diagnostic step - Consumer-side loss IS happening — just only in ASU, not everywhere ### 5. Replay experiment independently confirms ASU-specific loss A [replay tool][replay-code] was built into the ASU image and run as K8s Jobs in production. It reads events from Loki (the same events event-recorder stored), replays them through ASU's `Processor`, and counts: **409 = already processed** (ASU had it), **2xx = new** (ASU missed it). The tool intentionally does NOT add 409 to `allowed_errors` so duplicates are counted separately ([source][replay-409]). 200 replay jobs ran across 4 dates, covering multiple days of production data: | Run date | Jobs | New (missed) | 409s (had it) | Loss rate | |---|---|---|---|---| | 2026-01-15 | 35 | 4,438 | 428,977 | 1.024% | | 2026-01-16 | 112 | 608 | 2,968,829 | 0.021% | | 2026-01-17 | 5 | 272 | 3,024,879 | 0.009% | | 2026-02-16 | 48 | 183 | 522,793 | 0.035% | | **Total** | **200** | **5,501** | **6,945,478** | **0.079%** | > **Verify**: `{service_name=~"replay-from-loki-hour-.*"} "TOTAL"` over [Jan 15 – Mar 16][vlogs-replay-totals] The Jan 15 run had the highest rate (1.02%) because it was the first replay — subsequent runs found fewer new events as the replay itself backfilled gaps. The steady-state rate of ~0.02-0.04% on reruns suggests the loss is ongoing but at a lower rate than Cody's Mar 10 sample (0.93%). <details> <summary><strong>Evidence:</strong> Sample replay output (hour 18, Dec 29 data)</summary> Pod: `replay-from-loki-hour-18-4bht6`, ran 2026-01-17T13:47Z, replaying Dec 29 18:00-19:00 UTC ``` REPLAY SUMMARY Total events fetched from Loki: 952,391 Successfully processed (new): 104 Duplicates (409): 872,513 Discarded (no API call): 79,774 Errors: 0 By event type: Event Type New 409s Skip asset_status.updated 13 788 2047 incidents.deleted 0 5 7 incidents.updated 0 532 22060 intel_requests.created 1 41112 68 power_status.updated 0 62941 55574 search.expired 90 767135 18 TOTAL 104 872513 79774 ``` > **Verify**: `{service_name="replay-from-loki-hour-18-4bht6"} "REPLAY SUMMARY" OR "TOTAL"` on [Jan 17][vlogs-replay-sample] </details> --- ## Root Cause: Unknown ASU loses messages at a consistent low rate (~0.08% overall, spiking to ~1% on specific days). event-recorder and AIPT (same library, same exchange) get 100%. Unverified theories: 1. **Silent drop in ASU handler** — early return without logging 2. **Backpressure** — ASU is 3x slower per message (0.12-0.15s vs 0.05s) @@ -164,3 +209,7 @@ ASU loses ~4% of `intel_requests.created` messages consistently. event-recorder [vlogs-asu-ir-field]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/?query=service_name%3Aexact%28%22asset-status-updater%22%29%20AND%20message_type%3Aexact%28%22intel_requests.created%22%29%20AND%20intel_request_id%3Aexact%28%2256a219a6-7597-42c7-85ea-1636cdfa8e53%22%29&start=1741608000000&end=1741636800000 [vlogs-asu-ir-any]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/?query=service_name%3Aexact%28%22asset-status-updater%22%29%20AND%20%2256a219a6%22&start=1741608000000&end=1741636800000 [vlogs-aipt-ir]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/?query=service_name%3Aexact%28%22ai-prediction-tester%22%29%20AND%20intel_request_id%3Aexact%28%2256a219a6-7597-42c7-85ea-1636cdfa8e53%22%29&start=1741608000000&end=1741636800000 [replay-code]: https://gitlab.com/gisual/backend/consumers/asset-status-updater/-/blob/main/asset_status_updater/cli/replay_from_loki.py [replay-409]: https://gitlab.com/gisual/backend/consumers/asset-status-updater/-/blob/main/asset_status_updater/cli/replay_from_loki.py#L126-197 [vlogs-replay-totals]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/?query=%7Bservice_name%3D~%22replay-from-loki-hour-.%2A%22%7D%20%22TOTAL%22&start=1736899200000&end=1742169599000 [vlogs-replay-sample]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/?query=%7Bservice_name%3D%22replay-from-loki-hour-18-4bht6%22%7D%20%22REPLAY%20SUMMARY%22%20OR%20%22TOTAL%22&start=1737072000000&end=1737158400000 -
jmealo revised this gist
Mar 16, 2026 . 1 changed file with 10 additions and 10 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -154,13 +154,13 @@ ASU loses ~4% of `intel_requests.created` messages consistently. event-recorder [vrl-batch]: https://gitlab.com/gisual/configuration-management/kubernetes/-/blob/main/applications/infrastructure/charts/vector/vrl/07-extract-ids.vrl#L213-219 [mr17]: https://gitlab.com/gisual/backend/libraries/gisual-amqp-consumer/-/merge_requests/17 [mr48]: https://gitlab.com/gisual/backend/apis/intel-requests-api/-/merge_requests/48 [vlogs-asu]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/?query=service_name%3Aexact%28%22asset-status-updater%22%29%20AND%20message_type%3Aexact%28%22intel_requests.created%22%29%20AND%20batch_id%3Aexact%28%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%22%29&start=1741614600000&end=1741615800000 [vlogs-er]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/?query=service_name%3Aexact%28%22event-recorder%22%29%20AND%20message_type%3Aexact%28%22intel_requests.created%22%29%20AND%20batch_id%3Aexact%28%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%22%29&start=1741614600000&end=1741615800000 [vlogs-aipt]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/?query=service_name%3Aexact%28%22ai-prediction-tester%22%29%20AND%20message_type%3Aexact%28%22intel_requests.created%22%29%20AND%20batch_id%3Aexact%28%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%22%29&start=1741614600000&end=1741615800000 [vlogs-asu-all]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/?query=service_name%3Aexact%28%22asset-status-updater%22%29%20AND%20message_type%3Aexact%28%22intel_requests.created%22%29%20AND%20%28batch_id%3Aexact%28%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%22%29%20OR%20batch_id%3Aexact%28%22215181db-45e2-474f-9c05-839eecf61cc5%22%29%20OR%20batch_id%3Aexact%28%22812f79b7-c50d-4fd2-ac28-cc5d0318ee23%22%29%20OR%20batch_id%3Aexact%28%22d334c245-898b-4caa-b96c-954b8be31414%22%29%20OR%20batch_id%3Aexact%28%222186cea6-3ae5-40c1-805e-878cc123ca5c%22%29%20OR%20batch_id%3Aexact%28%227438ef94-89a7-48ae-83d2-71eb8a138f52%22%29%29&start=1741608000000&end=1741636800000 [vlogs-msgamp]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/?query=service_name%3Aexact%28%22message-amplifier%22%29%20AND%20%22c2a627b3%22&start=1741614600000&end=1741615800000 [vlogs-er-all]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/?query=service_name%3Aexact%28%22event-recorder%22%29%20AND%20message_type%3Aexact%28%22intel_requests.created%22%29%20AND%20%28batch_id%3Aexact%28%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%22%29%20OR%20batch_id%3Aexact%28%22215181db-45e2-474f-9c05-839eecf61cc5%22%29%20OR%20batch_id%3Aexact%28%22812f79b7-c50d-4fd2-ac28-cc5d0318ee23%22%29%20OR%20batch_id%3Aexact%28%22d334c245-898b-4caa-b96c-954b8be31414%22%29%20OR%20batch_id%3Aexact%28%222186cea6-3ae5-40c1-805e-878cc123ca5c%22%29%20OR%20batch_id%3Aexact%28%227438ef94-89a7-48ae-83d2-71eb8a138f52%22%29%29&start=1741608000000&end=1741636800000 [vlogs-aipt-all]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/?query=service_name%3Aexact%28%22ai-prediction-tester%22%29%20AND%20message_type%3Aexact%28%22intel_requests.created%22%29%20AND%20%28batch_id%3Aexact%28%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%22%29%20OR%20batch_id%3Aexact%28%22215181db-45e2-474f-9c05-839eecf61cc5%22%29%20OR%20batch_id%3Aexact%28%22812f79b7-c50d-4fd2-ac28-cc5d0318ee23%22%29%20OR%20batch_id%3Aexact%28%22d334c245-898b-4caa-b96c-954b8be31414%22%29%20OR%20batch_id%3Aexact%28%222186cea6-3ae5-40c1-805e-878cc123ca5c%22%29%20OR%20batch_id%3Aexact%28%227438ef94-89a7-48ae-83d2-71eb8a138f52%22%29%29&start=1741608000000&end=1741636800000 [vlogs-asu-ir-field]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/?query=service_name%3Aexact%28%22asset-status-updater%22%29%20AND%20message_type%3Aexact%28%22intel_requests.created%22%29%20AND%20intel_request_id%3Aexact%28%2256a219a6-7597-42c7-85ea-1636cdfa8e53%22%29&start=1741608000000&end=1741636800000 [vlogs-asu-ir-any]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/?query=service_name%3Aexact%28%22asset-status-updater%22%29%20AND%20%2256a219a6%22&start=1741608000000&end=1741636800000 [vlogs-aipt-ir]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/?query=service_name%3Aexact%28%22ai-prediction-tester%22%29%20AND%20intel_request_id%3Aexact%28%2256a219a6-7597-42c7-85ea-1636cdfa8e53%22%29&start=1741608000000&end=1741636800000 -
jmealo revised this gist
Mar 16, 2026 . 1 changed file with 13 additions and 6 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -39,7 +39,7 @@ Counted `intel_requests.created` processing logs per batch per consumer. Each me <details> <summary><strong>Evidence:</strong> Raw VictoriaLogs stats query + output</summary> Query (run per service — replace `{service}` with service name): ``` service_name:exact("{service}") AND message_type:exact("intel_requests.created") AND (batch_id:exact("c2a627b3-3f7c-4c89-808c-4d79b19c69cb") @@ -52,6 +52,8 @@ _time:[2026-03-10T13:00:00Z, 2026-03-10T20:00:00Z] | stats by (batch_id) count(*) as log_lines ``` View raw logs: [event-recorder][vlogs-er-all] | [AIPT][vlogs-aipt-all] | [ASU][vlogs-asu-all] | service | `c2a627b3` | `215181db` | `812f79b7` | `d334c245` | `2186cea6` | `7438ef94` | **total** | **÷2** | |---|---|---|---|---|---|---|---|---| | **event-recorder** | 348 | 44 | 136 | 128 | 44 | 108 | 808 | **404** | @@ -100,11 +102,11 @@ WHERE intel_request_id IN ('56a219a6-...', 'bc2d6cef-...', ...) Why searching by `intel_request_id` is misleading: | Service | `intel_request_id` in VictoriaLogs? | Source | On `intel_requests.created` lines? | Proof | |---|---|---|---|---| | **ASU** | Yes | Vector extracts from `(Intel Request {uuid})` in `_msg` ([VRL][vrl-ir], [Python][asu-discard]) | **No** — only on `power_status.updated` discard lines | [search returns 0 results][vlogs-asu-ir-field], [free-text finds only discard line][vlogs-asu-ir-any] | | **AIPT** | Yes | Python logs it on `fetch prediction` step ([source][aipt-fetch]) | **No** — different log line | [finds `fetch prediction`, not Processing/Processed][vlogs-aipt-ir] | | **event-recorder** | **No** | Only stores `correlation_id`, `organization_id` as metadata ([source][er-metadata]) | N/A | — | The shared amqp-consumer library ([source][lib-log]) logs `correlation_id`, `request_id`, and `batch_id` on every Processing/Processed line. It never logs `intel_request_id`. These context fields are set at [L632-642][lib-context]. @@ -157,3 +159,8 @@ ASU loses ~4% of `intel_requests.created` messages consistently. event-recorder [vlogs-aipt]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/logsql?query=service_name%3Aexact%28%22ai-prediction-tester%22%29%20AND%20message_type%3Aexact%28%22intel_requests.created%22%29%20AND%20batch_id%3Aexact%28%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%22%29&start=1741614600000&end=1741615800000 [vlogs-asu-all]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/logsql?query=service_name%3Aexact%28%22asset-status-updater%22%29%20AND%20message_type%3Aexact%28%22intel_requests.created%22%29%20AND%20%28batch_id%3Aexact%28%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%22%29%20OR%20batch_id%3Aexact%28%22215181db-45e2-474f-9c05-839eecf61cc5%22%29%20OR%20batch_id%3Aexact%28%22812f79b7-c50d-4fd2-ac28-cc5d0318ee23%22%29%20OR%20batch_id%3Aexact%28%22d334c245-898b-4caa-b96c-954b8be31414%22%29%20OR%20batch_id%3Aexact%28%222186cea6-3ae5-40c1-805e-878cc123ca5c%22%29%20OR%20batch_id%3Aexact%28%227438ef94-89a7-48ae-83d2-71eb8a138f52%22%29%29&start=1741608000000&end=1741636800000 [vlogs-msgamp]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/logsql?query=service_name%3Aexact%28%22message-amplifier%22%29%20AND%20%22c2a627b3%22&start=1741614600000&end=1741615800000 [vlogs-er-all]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/logsql?query=service_name%3Aexact%28%22event-recorder%22%29%20AND%20message_type%3Aexact%28%22intel_requests.created%22%29%20AND%20%28batch_id%3Aexact%28%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%22%29%20OR%20batch_id%3Aexact%28%22215181db-45e2-474f-9c05-839eecf61cc5%22%29%20OR%20batch_id%3Aexact%28%22812f79b7-c50d-4fd2-ac28-cc5d0318ee23%22%29%20OR%20batch_id%3Aexact%28%22d334c245-898b-4caa-b96c-954b8be31414%22%29%20OR%20batch_id%3Aexact%28%222186cea6-3ae5-40c1-805e-878cc123ca5c%22%29%20OR%20batch_id%3Aexact%28%227438ef94-89a7-48ae-83d2-71eb8a138f52%22%29%29&start=1741608000000&end=1741636800000 [vlogs-aipt-all]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/logsql?query=service_name%3Aexact%28%22ai-prediction-tester%22%29%20AND%20message_type%3Aexact%28%22intel_requests.created%22%29%20AND%20%28batch_id%3Aexact%28%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%22%29%20OR%20batch_id%3Aexact%28%22215181db-45e2-474f-9c05-839eecf61cc5%22%29%20OR%20batch_id%3Aexact%28%22812f79b7-c50d-4fd2-ac28-cc5d0318ee23%22%29%20OR%20batch_id%3Aexact%28%22d334c245-898b-4caa-b96c-954b8be31414%22%29%20OR%20batch_id%3Aexact%28%222186cea6-3ae5-40c1-805e-878cc123ca5c%22%29%20OR%20batch_id%3Aexact%28%227438ef94-89a7-48ae-83d2-71eb8a138f52%22%29%29&start=1741608000000&end=1741636800000 [vlogs-asu-ir-field]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/logsql?query=service_name%3Aexact%28%22asset-status-updater%22%29%20AND%20message_type%3Aexact%28%22intel_requests.created%22%29%20AND%20intel_request_id%3Aexact%28%2256a219a6-7597-42c7-85ea-1636cdfa8e53%22%29&start=1741608000000&end=1741636800000 [vlogs-asu-ir-any]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/logsql?query=service_name%3Aexact%28%22asset-status-updater%22%29%20AND%20%2256a219a6%22&start=1741608000000&end=1741636800000 [vlogs-aipt-ir]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/logsql?query=service_name%3Aexact%28%22ai-prediction-tester%22%29%20AND%20intel_request_id%3Aexact%28%2256a219a6-7597-42c7-85ea-1636cdfa8e53%22%29&start=1741608000000&end=1741636800000 -
jmealo revised this gist
Mar 16, 2026 . 1 changed file with 2 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -37,7 +37,7 @@ Counted `intel_requests.created` processing logs per batch per consumer. Each me > **Verify**: [ASU batch c2a627b3][vlogs-asu] (expect 336 lines) vs [event-recorder][vlogs-er] (expect 348 lines) vs [AIPT][vlogs-aipt] (expect 348 lines) — or [ASU all 6 batches][vlogs-asu-all] (expect 776 lines) <details> <summary><strong>Evidence:</strong> Raw VictoriaLogs stats query + output</summary> Query (run per service): ``` @@ -71,7 +71,7 @@ intel_request_id → metadata->>'correlation_id' (per-location, logged as CID) The `batch_id` is set to `self.correlation_id` in intel-requests-api ([source][batch-id-code]). The mapping query was run on `production-app-db-2-restore` partition `intel_requests_p20260301`. <details> <summary><strong>Evidence:</strong> DB query and ID mapping results</summary> ```sql SELECT intel_request_id::text, -
jmealo revised this gist
Mar 16, 2026 . 1 changed file with 5 additions and 3 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -52,9 +52,11 @@ _time:[2026-03-10T13:00:00Z, 2026-03-10T20:00:00Z] | stats by (batch_id) count(*) as log_lines ``` | service | `c2a627b3` | `215181db` | `812f79b7` | `d334c245` | `2186cea6` | `7438ef94` | **total** | **÷2** | |---|---|---|---|---|---|---|---|---| | **event-recorder** | 348 | 44 | 136 | 128 | 44 | 108 | 808 | **404** | | **AIPT** | 348 | 44 | 136 | 128 | 44 | 108 | 808 | **404** | | **ASU** | 336 | 42 | 132 | 124 | 40 | 102 | 776 | **388** | </details> ### 2. The previous RCA searched by the wrong ID -
jmealo revised this gist
Mar 16, 2026 . 1 changed file with 7 additions and 4 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,9 +1,12 @@ # RCA: Missing intel_requests.created (ENG-3349) — Corrected | | | |---|---| | **Status** | PROVISIONAL — ASU-specific loss confirmed, root cause unknown | | **Supersedes** | [Previous RCA](https://gist.github.com/jmealo/d81250648d67089111676b13de6120e7) | | **Incident date** | 2026-03-10 | | **Analysis date** | 2026-03-16 | | **Reverted** | [intel-requests-api !48](https://gitlab.com/gisual/backend/apis/intel-requests-api/-/merge_requests/48) (batch_id change from !47) | --- -
jmealo revised this gist
Mar 16, 2026 . 1 changed file with 104 additions and 107 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -1,54 +1,27 @@ # RCA: Missing intel_requests.created (ENG-3349) — Corrected **Status**: PROVISIONAL — ASU-specific loss confirmed, root cause unknown **Supersedes**: [Previous RCA](https://gist.github.com/jmealo/d81250648d67089111676b13de6120e7) **Incident date**: 2026-03-10 | **Analysis date**: 2026-03-16 **Reverted**: [intel-requests-api !48](https://gitlab.com/gisual/backend/apis/intel-requests-api/-/merge_requests/48) (batch_id change from !47) --- ## TL;DR 2,798/300,839 intel requests on March 10 never had facts created (~0.93%). The previous RCA said all consumers were losing messages due to a shutdown race in amqp-consumer. **Wrong.** Only ASU is losing them. event-recorder and AIPT received 100%. The previous analysis searched logs by `intel_request_id` — a field that doesn't exist on the log lines that prove delivery. --- ## Findings ### 1. Only ASU drops messages Counted `intel_requests.created` processing logs per batch per consumer. Each message = 2 log lines (Processing + Processed) from the shared amqp-consumer library ([source][lib-log]). | batch_id | event-recorder | AIPT | ASU | ASU lost | |---|---|---|---|---| | `c2a627b3` | 174 | 174 | 168 | 6 (3.4%) | | `215181db` | 22 | 22 | 21 | 1 (4.5%) | @@ -58,100 +31,124 @@ Each processed message produces exactly 2 log lines (`Processing...` + `Processe | `7438ef94` | 54 | 54 | 51 | 3 (5.6%) | | **Total** | **404** | **404** | **388** | **16 (3.96%)** | > **Verify**: [ASU batch c2a627b3][vlogs-asu] (expect 336 lines) vs [event-recorder][vlogs-er] (expect 348 lines) vs [AIPT][vlogs-aipt] (expect 348 lines) — or [ASU all 6 batches][vlogs-asu-all] (expect 776 lines) <details> <summary>Raw VictoriaLogs stats output</summary> Query (run per service): ``` service_name:exact("{service}") AND message_type:exact("intel_requests.created") AND (batch_id:exact("c2a627b3-3f7c-4c89-808c-4d79b19c69cb") OR batch_id:exact("215181db-45e2-474f-9c05-839eecf61cc5") OR batch_id:exact("812f79b7-c50d-4fd2-ac28-cc5d0318ee23") OR batch_id:exact("d334c245-898b-4caa-b96c-954b8be31414") OR batch_id:exact("2186cea6-3ae5-40c1-805e-878cc123ca5c") OR batch_id:exact("7438ef94-89a7-48ae-83d2-71eb8a138f52")) _time:[2026-03-10T13:00:00Z, 2026-03-10T20:00:00Z] | stats by (batch_id) count(*) as log_lines ``` - **event-recorder**: c2a627b3=348, 215181db=44, 812f79b7=136, d334c245=128, 2186cea6=44, 7438ef94=108 (total 808 ÷ 2 = 404) - **AIPT**: c2a627b3=348, 215181db=44, 812f79b7=136, d334c245=128, 2186cea6=44, 7438ef94=108 (total 808 ÷ 2 = 404) - **ASU**: c2a627b3=336, 215181db=42, 812f79b7=132, d334c245=124, 2186cea6=40, 7438ef94=102 (total 776 ÷ 2 = 388) </details> ### 2. The previous RCA searched by the wrong ID Cody provided `intel_request_id` values. These don't appear on message-processing log lines — those lines only have `correlation_id` (CID) and `batch_id`. To trace, you must map through the DB: ``` intel_request_id → metadata->>'correlation_id' (per-location, logged as CID) → metadata->>'batch_id' (per-request, = original correlation_id) ``` The `batch_id` is set to `self.correlation_id` in intel-requests-api ([source][batch-id-code]). The mapping query was run on `production-app-db-2-restore` partition `intel_requests_p20260301`. <details> <summary>DB query and results</summary> ```sql SELECT intel_request_id::text, (metadata->>'correlation_id')::text as correlation_id, (metadata->>'batch_id')::text as batch_id, created::text FROM intel_requests.intel_requests_p20260301 WHERE intel_request_id IN ('56a219a6-...', 'bc2d6cef-...', ...) ``` | intel_request_id | correlation_id | batch_id | created (UTC) | |---|---|---|---| | `56a219a6` | `11c84002` | `c2a627b3` | 13:55:46 | | `bc2d6cef` | `e12e237a` | `c2a627b3` | 13:55:46 | | `eac57da4` | `cec14379` | `215181db` | 13:55:50 | | `27b6e33c` | `e64cd0a6` | `215181db` | 13:55:50 | | `57007346` | `e5234837` | `812f79b7` | 15:18:44 | | `819448ae` | `59355b93` | `812f79b7` | 15:18:44 | | `55941d68` | `ab310e6b` | `812f79b7` | 15:18:44 | | `9f3454be` | `af861b31` | `d334c245` | 15:19:14 | | `2cfbd887` | `a578785d` | `2186cea6` | 15:20:27 | | `90125c20` | `602b7ea5` | `7438ef94` | 15:20:27 | </details> ### 3. Logging is inconsistent across services — and Vector adds a layer Why searching by `intel_request_id` is misleading: | Service | `intel_request_id` in VictoriaLogs? | Source | Appears on `intel_requests.created` lines? | |---|---|---|---| | **ASU** | Yes | Vector extracts from `(Intel Request {uuid})` in `_msg` ([VRL][vrl-ir], [Python][asu-discard]) | **No** — only on `power_status.updated` discard lines | | **AIPT** | Yes | Python logs it on `fetch prediction` step ([source][aipt-fetch]) | **No** — on a different log line, not Processing/Processed | | **event-recorder** | **No** | Only stores `correlation_id`, `organization_id` as metadata ([source][er-metadata]) | N/A | The shared amqp-consumer library ([source][lib-log]) logs `correlation_id`, `request_id`, and `batch_id` on every Processing/Processed line. It never logs `intel_request_id`. These context fields are set at [L632-642][lib-context]. Vector's VRL pipeline ([`07-extract-ids.vrl`][vrl-file]) also extracts IDs from unstructured `_msg` text — `(CID {uuid})` → `correlation_id` ([VRL][vrl-cid]), `(Batch ID {uuid})` → `batch_id` ([VRL][vrl-batch]). All patterns use `if !exists(...)` guards so they don't conflict with structured JSON fields. **Impact**: The structured fields in VictoriaLogs are correct for our batch-counting methodology. But searching by `intel_request_id` will find the wrong log lines (discard messages, not delivery proof). ### 4. What the previous RCA got right - Lock contention between `incidents.received` and `intel_requests.created` in message-amplifier IS real (confirmed in [message-amplifier logs for batch c2a627b3][vlogs-msgamp]) - The `_on_message` diagnostic logging in [MR 17][mr17] is still the right next diagnostic step - Consumer-side loss IS happening — just only in ASU, not everywhere --- ## Root Cause: Unknown ASU loses ~4% of `intel_requests.created` messages consistently. event-recorder and AIPT (same library, same exchange) get 100%. Unverified theories: 1. **Silent drop in ASU handler** — early return without logging 2. **Backpressure** — ASU is 3x slower per message (0.12-0.15s vs 0.05s) 3. **Queue config** — different prefetch, replicas, or resource limits ## Next Steps 1. **Deploy [MR 17][mr17] on ASU** with `LOG_ON_MESSAGE=true` — proves whether messages reach ASU's callback 2. **Audit ASU handler** for silent return paths on `intel_requests.created` 3. **Compare ASU queue config** to event-recorder (prefetch, replicas, limits) **Verification after MR 17**: compare `"Received message"` count on ASU vs event-recorder for same time window. Equal = handler bug. Lower = queue/delivery issue. --- <!-- Reference-style links — kept at bottom, GitHub renders them as clickable --> [lib-log]: https://gitlab.com/gisual/backend/libraries/gisual-amqp-consumer/-/blob/f2b9dc354f360c7489a2e0aef3f044cc6e1b3f69/gisual_amqp_consumer/app.py#L657-725 [lib-context]: https://gitlab.com/gisual/backend/libraries/gisual-amqp-consumer/-/blob/f2b9dc354f360c7489a2e0aef3f044cc6e1b3f69/gisual_amqp_consumer/app.py#L632-642 [batch-id-code]: https://gitlab.com/gisual/backend/apis/intel-requests-api/-/blob/f8d1813/intel_requests_api/intel_requests.py#L1079 [asu-discard]: https://gitlab.com/gisual/backend/consumers/asset-status-updater/-/blob/d70ce19a11b0cf28f51b65c26584e5b98bb6fea9/asset_status_updater/processor.py#L248-252 [aipt-fetch]: https://gitlab.com/gisual/backend/consumers/ai-prediction-tester/-/blob/1ed1fc570cb3904bc79373caf5f5b1008c93621e/ai_prediction_tester/app.py#L219-224 [er-metadata]: https://gitlab.com/gisual/backend/consumers/event-recorder/-/blob/d8e53d560ef911d55a1029f95853ded12b47beed/event_recorder/app.py#L372-387 [vrl-file]: https://gitlab.com/gisual/configuration-management/kubernetes/-/blob/main/applications/infrastructure/charts/vector/vrl/07-extract-ids.vrl [vrl-ir]: https://gitlab.com/gisual/configuration-management/kubernetes/-/blob/main/applications/infrastructure/charts/vector/vrl/07-extract-ids.vrl#L198-203 [vrl-cid]: https://gitlab.com/gisual/configuration-management/kubernetes/-/blob/main/applications/infrastructure/charts/vector/vrl/07-extract-ids.vrl#L205-211 [vrl-batch]: https://gitlab.com/gisual/configuration-management/kubernetes/-/blob/main/applications/infrastructure/charts/vector/vrl/07-extract-ids.vrl#L213-219 [mr17]: https://gitlab.com/gisual/backend/libraries/gisual-amqp-consumer/-/merge_requests/17 [mr48]: https://gitlab.com/gisual/backend/apis/intel-requests-api/-/merge_requests/48 [vlogs-asu]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/logsql?query=service_name%3Aexact%28%22asset-status-updater%22%29%20AND%20message_type%3Aexact%28%22intel_requests.created%22%29%20AND%20batch_id%3Aexact%28%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%22%29&start=1741614600000&end=1741615800000 [vlogs-er]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/logsql?query=service_name%3Aexact%28%22event-recorder%22%29%20AND%20message_type%3Aexact%28%22intel_requests.created%22%29%20AND%20batch_id%3Aexact%28%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%22%29&start=1741614600000&end=1741615800000 [vlogs-aipt]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/logsql?query=service_name%3Aexact%28%22ai-prediction-tester%22%29%20AND%20message_type%3Aexact%28%22intel_requests.created%22%29%20AND%20batch_id%3Aexact%28%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%22%29&start=1741614600000&end=1741615800000 [vlogs-asu-all]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/logsql?query=service_name%3Aexact%28%22asset-status-updater%22%29%20AND%20message_type%3Aexact%28%22intel_requests.created%22%29%20AND%20%28batch_id%3Aexact%28%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%22%29%20OR%20batch_id%3Aexact%28%22215181db-45e2-474f-9c05-839eecf61cc5%22%29%20OR%20batch_id%3Aexact%28%22812f79b7-c50d-4fd2-ac28-cc5d0318ee23%22%29%20OR%20batch_id%3Aexact%28%22d334c245-898b-4caa-b96c-954b8be31414%22%29%20OR%20batch_id%3Aexact%28%222186cea6-3ae5-40c1-805e-878cc123ca5c%22%29%20OR%20batch_id%3Aexact%28%227438ef94-89a7-48ae-83d2-71eb8a138f52%22%29%29&start=1741608000000&end=1741636800000 [vlogs-msgamp]: http://vlogs-production.monitoring.svc.cluster.local/select/vmui/#/logsql?query=service_name%3Aexact%28%22message-amplifier%22%29%20AND%20%22c2a627b3%22&start=1741614600000&end=1741615800000 -
jmealo created this gist
Mar 16, 2026 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,157 @@ # RCA: Missing intel_requests.created — Corrected Analysis (ENG-3349) **Status**: PROVISIONAL — ASU-specific loss confirmed, root cause unknown **Date**: 2026-03-10 incident, 2026-03-16 re-analysis **JIRA**: [ENG-3349](https://gisual.atlassian.net/browse/ENG-3349) **Previous RCA**: [Gist d81250648d](https://gist.github.com/jmealo/d81250648d67089111676b13de6120e7) (superseded by this document) --- ## Summary 2,798 of 300,839 intel requests on March 10 never had facts created (~0.93%). Re-analysis of 6 sample batches (404 IRs) using correct identifiers shows: - **event-recorder**: 404/404 — **0% loss**[^stats-er] - **AIPT**: 404/404 — **0% loss**[^stats-aipt] - **ASU**: 388/404 — **3.96% loss**, consistent across every batch[^stats-asu] **The loss is ASU-specific.** The previous RCA incorrectly attributed loss to all consumers due to a search methodology error[^prev-error]. ## What the Previous RCA Got Wrong | Claim | Correction | |---|---| | "event-recorder missed 8,146" | 0% loss — received every message in all sampled batches[^stats-er] | | "AIPT received but ASU didn't" | AIPT also received 100%[^stats-aipt] | | "Independent consumer-side shutdown race" | Only ASU loses messages — shared library bug would affect all equally | | MR 47: separate batch_id fixes lock contention | Broke traceability, didn't address ASU loss — reverted[^mr48] | The error: consumer logs index messages by `correlation_id` and `batch_id`, not `intel_request_id`. The original searched by `intel_request_id`, found nothing, and concluded messages were lost. In reality, the messages were there under different identifiers[^id-mapping]. ## Tracing Methodology ### ID Mapping Cody provided `intel_request_id` values. Consumer logs don't contain this field on processing lines[^field-proof]. The mapping is: ``` intel_request_id → DB metadata column[^db-query] └→ correlation_id (unique per location, logged as CID) └→ batch_id (shared per API request, = original correlation_id)[^batch-id-code] ``` 10 sample IRs mapped to 6 distinct batches, all org `9de5d801`[^db-query]. ### Counting Method Each processed message produces exactly 2 log lines (`Processing...` + `Processed...`) from the shared amqp-consumer library[^lib-logging]. Counting `message_type:exact("intel_requests.created")` lines per `batch_id` per service, divided by 2, gives IRs processed[^stats-query]. ## Results | batch_id | event-recorder | AIPT | ASU | ASU missing | |---|---|---|---|---| | `c2a627b3` | 174 | 174 | 168 | 6 (3.4%) | | `215181db` | 22 | 22 | 21 | 1 (4.5%) | | `812f79b7` | 68 | 68 | 66 | 2 (2.9%) | | `d334c245` | 64 | 64 | 62 | 2 (3.1%) | | `2186cea6` | 22 | 22 | 20 | 2 (9.1%) | | `7438ef94` | 54 | 54 | 51 | 3 (5.6%) | | **Total** | **404** | **404** | **388** | **16 (3.96%)** | ## Why `intel_request_id` Appears as a Structured Field (But Misleads) Vector's VRL pipeline[^vrl-extract-ids] extracts IDs from unstructured log text into structured VictoriaLogs fields: | Log pattern | Extracted field | VRL source | |---|---|---| | `(Intel Request {uuid})` | `intel_request_id` | [^vrl-ir] | | `(CID {uuid})` | `correlation_id` | [^vrl-cid] | | `(Batch ID {uuid})` | `batch_id` | [^vrl-batch] | ASU logs `(Intel Request {id}): Received power_status.updated...`[^asu-discard-code] on discard lines — Vector extracts this into `intel_request_id`. But the generic `Processing/Processed message type: intel_requests.created` lines[^lib-logging] don't contain that pattern, so **`intel_request_id` is absent on the lines that prove message delivery**. This creates a trap: searching `intel_request_id:exact("56a219a6...")` finds ASU's `power_status.updated` discard (hours later, different batch) but misses the `intel_requests.created` processing entirely. Event-recorder never logs individual IR IDs at all[^er-metadata] — it only stores `event_type`, `correlation_id`, and `organization_id` as structured metadata. ## What We Don't Know The root cause of ASU's ~4% loss is unidentified. Theories (all unverified): 1. **Silent discard in ASU handler** — an early return path that doesn't log[^asu-discard-code] 2. **Backpressure** — ASU processing time (0.12-0.15s/msg) is 3x slower than event-recorder (0.05s/msg) 3. **Queue configuration** — different prefetch/QoS or replica count ## Next Steps 1. **Deploy `_on_message` logging (MR 17[^mr17]) on ASU** — determines whether messages reach ASU's callback at all 2. **Audit ASU handler for silent drops** — check all return paths in `intel_requests.created` processing 3. **Compare ASU queue config** to event-recorder and AIPT (prefetch, replicas, resource limits) ### Verification Query (post MR 17 deployment) ``` service_name:exact("asset-status-updater") AND "Received message" AND "intel_requests.created" ``` Compare count to event-recorder's count for the same time window. If equal → loss is post-delivery (handler bug). If lower → loss is pre-delivery (RabbitMQ/queue issue). ## Actions Taken - **Reverted MR 47**[^mr48] — restored `correlation_id` as `batch_id` (traceability fix) - MR 17[^mr17] still open — diagnostic logging, pending Diego's review of shutdown protection --- ## Grounding ### Database [^db-query]: `SELECT intel_request_id::text, (metadata->>'correlation_id')::text, (metadata->>'batch_id')::text FROM intel_requests.intel_requests_p20260301 WHERE intel_request_id IN (...)` on `production-app-db-2-restore`. 10 IRs → 6 batches, all org `9de5d801`. [^id-mapping]: `intel_request_id` is stored in the DB row. `correlation_id` and `batch_id` are in `metadata` JSONB. Consumer logs only contain the latter two on processing lines. ### VictoriaLogs Queries [^stats-query]: `victorialogs_stats_query` with `service_name:exact("{service}") AND message_type:exact("intel_requests.created") AND (batch_id:exact("c2a627b3-...") OR ...) _time:[2026-03-10T13:00:00Z, 2026-03-10T20:00:00Z] | stats by (batch_id) count(*) as log_lines` — run per service at `time=2026-03-10T20:00:00Z`. [^stats-er]: event-recorder raw: `c2a627b3`=348, `215181db`=44, `812f79b7`=136, `d334c245`=128, `2186cea6`=44, `7438ef94`=108. Total=808 lines ÷2 = 404 IRs. [^stats-aipt]: AIPT raw: `c2a627b3`=348, `215181db`=44, `812f79b7`=136, `d334c245`=128, `2186cea6`=44, `7438ef94`=108. Total=808 lines ÷2 = 404 IRs. [^stats-asu]: ASU raw: `c2a627b3`=336, `215181db`=42, `812f79b7`=132, `d334c245`=124, `2186cea6`=40, `7438ef94`=102. Total=776 lines ÷2 = 388 IRs. [^field-proof]: `victorialogs_field_names` for `message_type:exact("intel_requests.created")` per service, 2026-03-10T13:00-14:00Z. ASU: `intel_request_id` present (27,478 hits on co-stored lines, zero hits when filtered to `intel_requests.created` + specific IR ID). event-recorder: field absent entirely. ### Grafana Verification Links - [ASU batch c2a627b3](http://grafana-production.monitoring.svc.cluster.local/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%7D&from=2026-03-10T13%3A50%3A00Z&to=2026-03-10T14%3A10%3A00Z&expr=service_name%3Aexact%28%22asset-status-updater%22%29+AND+message_type%3Aexact%28%22intel_requests.created%22%29+AND+batch_id%3Aexact%28%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%22%29) — expect 336 lines - [event-recorder batch c2a627b3](http://grafana-production.monitoring.svc.cluster.local/explore?left=%7B%22datasource%22%3A%22PD775F2863313E6C7%22%7D&from=2026-03-10T13%3A50%3A00Z&to=2026-03-10T14%3A10%3A00Z&expr=service_name%3Aexact%28%22event-recorder%22%29+AND+message_type%3Aexact%28%22intel_requests.created%22%29+AND+batch_id%3Aexact%28%22c2a627b3-3f7c-4c89-808c-4d79b19c69cb%22%29) — expect 348 lines ### Source Code [^lib-logging]: amqp-consumer generic logging — [`gisual-amqp-consumer/app.py:657-664`](https://gitlab.com/gisual/backend/libraries/gisual-amqp-consumer/-/blob/f2b9dc354f360c7489a2e0aef3f044cc6e1b3f69/gisual_amqp_consumer/app.py#L657-664) (Processing) and [`L717-725`](https://gitlab.com/gisual/backend/libraries/gisual-amqp-consumer/-/blob/f2b9dc354f360c7489a2e0aef3f044cc6e1b3f69/gisual_amqp_consumer/app.py#L717-725) (Processed). Logs `correlation_id`, `request_id`, `batch_id` — never `intel_request_id`. Logger context fields set at [`L632-642`](https://gitlab.com/gisual/backend/libraries/gisual-amqp-consumer/-/blob/f2b9dc354f360c7489a2e0aef3f044cc6e1b3f69/gisual_amqp_consumer/app.py#L632-642). [^batch-id-code]: `batch_id = self.correlation_id` in intel-requests-api — [`intel_requests.py:1079`](https://gitlab.com/gisual/backend/apis/intel-requests-api/-/blob/f8d1813/intel_requests_api/intel_requests.py#L1079) (post-revert). [^asu-discard-code]: ASU `power_on` discard logging — [`processor.py:248-252`](https://gitlab.com/gisual/backend/consumers/asset-status-updater/-/blob/d70ce19a11b0cf28f51b65c26584e5b98bb6fea9/asset_status_updater/processor.py#L248-252). Logs `intel_request_id` in `_msg` text as `(Intel Request %s)`. [^er-metadata]: event-recorder Loki structured metadata — [`app.py:372-387`](https://gitlab.com/gisual/backend/consumers/event-recorder/-/blob/d8e53d560ef911d55a1029f95853ded12b47beed/event_recorder/app.py#L372-387). Only `event_type`, `correlation_id`, `organization_id`, `organization_short_name`. ### Vector VRL (Log Field Extraction) [^vrl-extract-ids]: Vector VRL ID extraction pipeline — [`vector/vrl/07-extract-ids.vrl`](https://gitlab.com/gisual/configuration-management/kubernetes/-/blob/main/applications/infrastructure/charts/vector/vrl/07-extract-ids.vrl). All patterns use `if !exists(...)` guards (only extract from text when field absent from structured JSON). [^vrl-ir]: `intel_request_id` from `(Intel Request {uuid})` — [`07-extract-ids.vrl:198-203`](https://gitlab.com/gisual/configuration-management/kubernetes/-/blob/main/applications/infrastructure/charts/vector/vrl/07-extract-ids.vrl#L198-203). [^vrl-cid]: `correlation_id` from `(CID {uuid})` — [`07-extract-ids.vrl:205-211`](https://gitlab.com/gisual/configuration-management/kubernetes/-/blob/main/applications/infrastructure/charts/vector/vrl/07-extract-ids.vrl#L205-211). [^vrl-batch]: `batch_id` from `(Batch ID {uuid})` — [`07-extract-ids.vrl:213-219`](https://gitlab.com/gisual/configuration-management/kubernetes/-/blob/main/applications/infrastructure/charts/vector/vrl/07-extract-ids.vrl#L213-219). ### MRs [^mr48]: Revert MR — [intel-requests-api !48](https://gitlab.com/gisual/backend/apis/intel-requests-api/-/merge_requests/48) (merged 2026-03-16). Restores `self.correlation_id` as `batch_id`. [^mr17]: Diagnostic logging MR — [gisual-amqp-consumer !17](https://gitlab.com/gisual/backend/libraries/gisual-amqp-consumer/-/merge_requests/17) (open, pending review). [^prev-error]: Original RCA gist did not preserve exact queries used. The gist states findings like "No log for IR 56a219a6" without query syntax. Based on field analysis, the search likely used `intel_request_id` or free-text IR UUID, which doesn't match `intel_requests.created` processing lines.