Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save escherize/9513c0a8ce5d91bd7026557e1f4444cf to your computer and use it in GitHub Desktop.

Select an option

Save escherize/9513c0a8ce5d91bd7026557e1f4444cf to your computer and use it in GitHub Desktop.
Enhance Usage Analytics for Embedding - AI Report

Enhance Usage Analytics for Embedding

Branch: enhance-usage-analytics-for-embedding Date: 2026-03-24 Stakeholders: Roman (spec owner), Vamsi (PM), John Swanson (tech design), Ben Grabow (implementation), Alessio Laiso (hackathon demo/mockup), Eliot (optional) Tech doc: Linear doc

What and Why

Metabase's embedding customers lack visibility into how their embedded content is being used. The view_log and query_execution tables already capture some embedding context (embedding_client, embedding_version since v51), but don't track auth method, tenant, GDPR-sensitive request metadata, or security enforcement flags.

This project extends both tables with richer context columns, creates/updates analytics views to expose them, and adds a GDPR toggle setting. The goal is to power an "Embedded Analytics" dashboard (demoed at hackathon by Alessio) showing embedding views, usage trends, tenant activity, and content popularity.

Project Context (from kickoff meeting 2026-03-25)

Collaboration model: This is a pairing project between Bryan and Ben Grabow. The embedding team is frontend-only, so backend help is needed (similar to past SDK admin settings work). Vamsi framed this as a full SDLC exercise - not just writing code, but the entire lifecycle: how to start, track, announce, provide updates in the project channel, ask people to test, etc.

Ben's personal goal: Wants to use this project to dive deep and understand things in detail rather than surface-level. Vamsi acknowledged this and noted the project is straightforward enough to be a good container for that.

Scope boundary: Our deliverable is the server-side work - migrations, populating the new columns, updating analytics views. Once our PR merges to master, a separate PR (done by the embedding/frontend team) will enhance the UI dashboards to consume this new data. We don't need to build the dashboards.

Timeline: Vamsi's rough expectation is 3-5 days. Reality is reality - communicate if it takes longer.

Testing: Should test with actual embedding scenarios (SDK, iframe, public links) and verify context is being captured into the database. Loom video updates encouraged in the project Slack channel.

Usage analytics system: The analytics views system was recently overhauled (by Bronza/someone). There's a mesh command that creates a local database and pulls YAML from resources, making it easy to change views and serialize back. Vamsi will share the relevant doc in the project Slack channel.

Data flow: Need to get HTTP header data (embedding client, user agent, IP, etc.) from request context and propagate it to where view_log entries are created. Bryan noted there's already infrastructure for this (metabase.analytics.sdk middleware, (metabase.request.current/current-request)).

Key decisions from meeting with Roman (2026-03-26)

  • Surface derivation priority: Endpoint-based surface wins over X-Metabase-Client header. If no endpoint match, check header. If no header, default to "internal". (Roman confirmed explicitly.)
  • Embedding client expansion: Currently only 3 values exist (embedding-sdk-react, embedding-simple, embedding-iframe). Embedding team will add embedding-iframe-full-app, embedding-iframe-static, embedding-iframe-public, and embedding-mcp as new client-side header values. Backend should handle unknown values as pass-through.
  • Auth method: Roman described the need to track how sessions were generated (JWT, SAML, etc.). Confirmed that auth_identity.provider (which already exists) provides this mapping. May be cut from first PR if complex; NULL auth_method is acceptable as a starting point.
  • is_download: Already derivable from query_execution.context (ends-with "download"). No new column needed.
  • Active tenants/users: Roman prefers views with actual rows (not just counts) so users can build their own count queries.
  • Tenant audit event: Roman demoed that tenant creation has empty details in audit log. Needs fix (Ticket 8).
  • Scope: Our deliverable is server-side only (migrations, data population, views). Embedding team does the dashboards/UX separately.

Architecture Overview

What exists today

  • view_log: Records content views (dashboards, questions, collections). Has embedding_client and embedding_version but not exposed in the v_view_log analytics view. Has has_access and context columns also unexposed.
  • query_execution: Records query runs. Has embedding_client, embedding_version, is_sandboxed, and context (which already distinguishes download types). Also not fully exposed in v_query_log.
  • SDK middleware (metabase.analytics.sdk): Binds *client* and *version* from X-Metabase-Client / X-Metabase-Client-Version headers. Currently only covers SDK/iframe clients - doesn't cover public, guest-embed, metabot, or agent-api surfaces.
  • Session middleware (metabase.server.middleware.session): Resolves current user from session or API key. Joins to auth_identity table (which has a provider column with values like "password", "jwt", "saml", "google", "ldap", "oidc"). Currently does NOT select the provider during session resolution.
  • Tenant model: EE-only. core_user.tenant_id exists and is loaded via @api/*current-user*. Tenant table has: id, name, slug, is_active, attributes, created_at, updated_at, tenant_collection_id.

What we're adding

Area Change Tables
Auth tracking auth_method varchar(32) view_log, query_execution
Tenant tracking tenant_id integer view_log, query_execution
Security flags is_impersonated, is_db_routed booleans query_execution only
Query parameters parameters text (JSON) query_execution only
GDPR-gated embedding_hostname, embedding_path, sanitized_user_agent, ip_address view_log, query_execution
Surface derivation Computed from embedding_client at read time via CASE analytics views only
Admin setting collect-analytics-pii boolean (default false, no feature gate) settings table
New view v_tenants analytics views
View updates v_view_log v4, v_query_log v5, v_users v5 analytics views

Design decisions (from tech doc)

  • Extend existing tables rather than new tables - simpler, no joins needed
  • Both tables get context columns because downloads/API calls only hit query_execution (no view_log entry)
  • Surface is derived at read time from embedding_client via CASE expression - if mapping evolves, historical data updates automatically
  • Auth method determined at request time and bound as dynamic var - avoids per-event DB queries for stateless auth
  • GDPR columns always exist but only populated when admin setting enabled
  • Analytics view SQL lives in resources/migrations/instance_analytics_views/ - one subdirectory per view, versioned (v1/, v2/, ...), with per-database SQL files (postgres, mysql, h2). MySQL views need SQL SECURITY INVOKER and use concat() instead of ||.

Implementation Tickets

Ticket 1: Database Migrations - Add Columns

Description: Add new nullable columns to view_log and query_execution tables, plus indexes on tenant_id.

Files to create/modify:

  • resources/migrations/060/<timestamp>_embedding_analytics.yaml - new migration file

Columns for view_log: auth_method (varchar 32), tenant_id (int), embedding_hostname (varchar 512), embedding_path (varchar 2048), sanitized_user_agent (varchar 512), ip_address (varchar 45)

Columns for query_execution: auth_method (varchar 32), tenant_id (int), is_impersonated (boolean), is_db_routed (boolean), parameters (text), embedding_hostname (varchar 512), embedding_path (varchar 2048), sanitized_user_agent (varchar 512), ip_address (varchar 45)

Indexes: idx_view_log_tenant_id, idx_query_execution_tenant_id

Notes:

  • Follow existing pattern of one changeset per column per table (see v51.2024-07-22 for embedding_client/version precedent)
  • Adding nullable columns is metadata-only on Postgres/MySQL - no table rewrite
  • Use ${boolean.type} for booleans (maps to bit(1) on MySQL)
  • Latest v60 timestamp is ~2026-03-17, so use 2026-03-24+

Ticket 2: Admin Setting - collect-analytics-pii

Description: Add a boolean admin setting that gates collection of GDPR-sensitive fields.

Files to modify:

  • src/metabase/embedding/settings.clj - add defsetting (or a more general namespace, since this is not embedding-specific)

Implementation:

(defsetting collect-analytics-pii
  (deferred-tru "Controls whether analytics collects personally identifiable information (hostname, path, user agent, IP).")
  :type       :boolean
  :default    false
  :visibility :admin
  :export?    true)

Notes: Per Roman, this is NOT embedding-specific - PII collection applies to all surfaces. No :feature gate (available to all instances). Default false means GDPR columns won't be populated until an admin explicitly enables it. Reading a feature-gated setting on non-EE instances returns the default safely (no exception), but since there's no gate here, this is moot.

Ticket 3: Auth Method Detection and Dynamic Var

Description: Determine auth_method during request processing and make it available as a dynamic var for view_log/query_execution population.

Files to modify:

  • src/metabase/request/current.clj - add *auth-method* dynamic var (NOT api/common.clj which has TODOs to migrate vars away; request.current is the intended home for new request-scoped vars)
  • src/metabase/server/middleware/session.clj - add auth_identity.provider to session query; set auth-method in merge-current-user-info
  • src/metabase/request/session.clj - bind *auth-method* in middleware layer OUTSIDE do-with-current-user (so as-admin doesn't touch it - Clojure binding only shadows explicitly named vars)

Auth method values and where each is detected:

All session-based methods flow through auth_identity.provider via the LEFT JOIN in session-with-id-query. The session is created with an auth_identity_id FK during the login flow for each provider.

Value How user authenticates Where session/identity is created Where detected at request time
password POST /api/session with email+password auth-identity.provider/login! :provider/password -> auth-identity.session/create-session-with-auth-tracking! LEFT JOIN auth_identity.provider in session-with-id-query
jwt GET /auth/sso?jwt=<token> metabase-enterprise.sso.integrations.jwt/sso-get -> login! :provider/jwt -> create-session-with-auth-tracking! LEFT JOIN auth_identity.provider
saml GET /auth/sso (redirect) -> POST /auth/sso (IdP callback) metabase-enterprise.sso.integrations.saml/sso-post -> login! :provider/saml -> create-session-with-auth-tracking! LEFT JOIN auth_identity.provider
google POST /auth/sso with Google OAuth token metabase.sso.providers.google -> login! :provider/google -> create-session-with-auth-tracking! LEFT JOIN auth_identity.provider
ldap POST /api/session (LDAP enabled, checked before password) session.api/login -> login! :provider/ldap -> create-session-with-auth-tracking! LEFT JOIN auth_identity.provider
oidc Enterprise OIDC flow login! :provider/oidc -> create-session-with-auth-tracking! LEFT JOIN auth_identity.provider
custom-oidc Enterprise custom OIDC login! :provider/custom-oidc -> create-session-with-auth-tracking! LEFT JOIN auth_identity.provider
slack-connect Slack Connect auth login! :provider/slack-connect -> create-session-with-auth-tracking! LEFT JOIN auth_identity.provider
support-access-grant Support access grant flow login! :provider/support-access-grant -> create-session-with-auth-tracking! LEFT JOIN auth_identity.provider
api-key X-Api-Key header on any request No session created. Key created via advanced-config.file.api-keys/init-from-config-file! or API. Dedicated :type :api-key user. Hardcoded in current-user-info-for-api-key -> (assoc :auth-method "api-key")
guest /api/embed/:token (signed JWT in URL) No session. Stateless - JWT verified per request by metabase.embedding-rest.api.embed. No user identity established. TODO: Bind *auth-method* in embed route middleware (Ticket 4 scope)
public /api/public/:uuid (public UUID in URL) No session. Stateless - UUID verified per request by metabase.public-sharing-rest.api. No user identity. TODO: Bind *auth-method* in public route middleware (Ticket 4 scope)
NULL Legacy sessions (pre-auth_identity migration), pulse/background jobs Pre-existing core_session rows without auth_identity_id LEFT JOIN returns NULL (correct)

Request-time flow for session-based methods:

  1. User authenticates via one of the provider login flows above
  2. auth-identity.session/create-session-with-auth-tracking! creates core_session row with auth_identity_id FK
  3. Session key returned as cookie (metabase.SESSION) or X-Metabase-Session header
  4. On subsequent requests: wrap-session-key extracts session key -> wrap-current-user-info runs session-with-id-query which LEFT JOINs auth_identity to get provider -> merged onto request as :auth-method
  5. bind-current-user wraps handler in (sdk/with-auth-method! (:auth-method request) ...) to bind the dynamic var

Request-time flow for API keys:

  1. wrap-current-user-info -> current-user-info-for-api-key matches key prefix, verifies bcrypt hash
  2. Returns (assoc :auth-method "api-key") - no auth_identity table involved
  3. Same bind-current-user binding as above

Request-time flow for embed/public (TODO - not yet implemented):

  1. These routes don't go through session resolution at all
  2. Need server-side (sdk/with-auth-method! "guest" ...) or "public" binding in route middleware
  3. Will be done alongside Ticket 4's +embedding-client route wrapper

Implementation approach:

  • Add LEFT JOIN to auth_identity in the session-with-id-query (compiled, memoized query at line ~99-140 of session.clj). This adds provider to every session lookup with zero extra queries.
  • For API keys: return {:auth-method "api-key" ...} from current-user-info-for-api-key
  • For guest/public: these don't go through session resolution. Bind *auth-method* in the respective API route handlers or wrapping middleware (Ticket 4).

Critical file references:

  • Session query: src/metabase/server/middleware/session.clj lines 99-140
  • merge-current-user-info: same file lines 235-243
  • do-with-current-user: src/metabase/request/session.clj lines 38-61
  • Current user vars: src/metabase/api/common.clj lines 104-131

Ticket 4: Expand embedding_client Vocabulary for Surface Derivation

Description: Set embedding_client server-side for surfaces that don't send the X-Metabase-Client header (public, guest-embed, metabot, agent-api).

Files to modify:

  • src/metabase/public_sharing_rest/api.clj (or middleware) - bind analytics.sdk/*client* to "public"
  • src/metabase/embedding_rest/api/embed.clj (or middleware) - bind *client* to "guest-embed"
  • src/metabase/metabot/api.clj (or middleware) - bind *client* to "metabot"
  • src/metabase/agent_api/api.clj (or middleware) - bind *client* to "agent-api"

Implementation: Each uses an unconditional with-client! binding. On conflicts (e.g., SDK iframe hitting a public route), the endpoint's server-side value wins - the endpoint determines which surface the request is actually being served through. Create a thin middleware wrapper in api_routes/routes.clj:

(defn- +embedding-client [client handler]
  (fn [request respond raise]
    (sdk/with-client! [client]
      (handler request respond raise))))

Applied to routes:

"/public"  (+public-exceptions (+embedding-client "public" metabase.public-sharing-rest.api/routes))
"/embed"   (+message-only-exceptions (+embedding-client "guest-embed" metabase.embedding-rest.api/embedding-routes))
"/metabot" (+embedding-client "metabot" metabase.metabot.api/routes)
"/agent"   (+embedding-client "agent-api" metabase.agent-api.api/routes)

Surface mapping (computed at read time in analytics views):

embedding_client value surface source
NULL internal no header, no server binding
embedding-sdk-react sdk SDK header
embedding-sdk-react-preview sdk-preview SDK + X-Metabase-Embedded-Preview header
embedding-simple modular Simple/modular embedding header
embedding-simple-preview modular-preview Simple + preview header
embedding-iframe iframe iframe detection header
embedding-iframe-preview iframe-preview iframe + preview header
public public server-side binding (Ticket 4)
guest-embed guest server-side binding (Ticket 4)
metabot metabot server-side binding (Ticket 4)
agent-api agent-api server-side binding (Ticket 4)
(anything else) pass through raw value future-proofing

Ticket 5: Populate New Columns on view_log

Description: When creating view_log entries, include auth_method, tenant_id, and GDPR-gated fields.

Files to modify:

  • src/metabase/view_log/events/view_log.clj - modify generate-view to include new fields
  • src/metabase/view_log/models/view_log.clj - update before-insert if needed

Implementation:

  • auth_method: from request.current/*auth-method* dynamic var (Ticket 3)
  • tenant_id: from (:tenant_id @api/*current-user*) - already available
  • GDPR fields: check (collect-analytics-pii) setting, if true extract from (metabase.request.current/current-request):
    • embedding_hostname: Origin header, extract hostname
    • embedding_path: Referer header, extract path (strip query params, truncate to 2048)
    • sanitized_user_agent: User-Agent header (store raw)
    • ip_address: request.util/ip-address

Ticket 6: Populate New Columns on query_execution

Description: When creating query_execution entries, include auth_method, tenant_id, security flags, parameters, and GDPR-gated fields.

Files to modify:

  • src/metabase/query_processor/middleware/process_userland_query.clj - modify query-execution-info (line 125-159) and save-successful-execution-metadata! (line 72-78)
  • src/metabase/queries/models/query_execution.clj - update before-insert if needed

Implementation (two categories of fields based on when data is available):

Category A - Captured in query :info map via qp/userland-query (Ring thread, before async boundary):

  • auth_method: from *auth-method* dynamic var (bound in request.current)
  • tenant_id: from (:tenant_id @api/*current-user*)
  • parameters: JSON-encode (:parameters query) (top-level filter values)
  • GDPR fields: from (request.current/current-request) headers, gated by (collect-analytics-pii) setting

Best injection point: qp/userland-query in src/metabase/query_processor.clj (lines 101-108) - the single centralized function ALL userland query entry points call. Add a helper that reads dynamic vars and merges into :info:

(defn userland-query [query info]
  (-> query
      (assoc-in [:middleware :userland-query?] true)
      (update :info merge info (embedding-analytics-info))))

Category B - Set on result metadata (inside QP pipeline, extracted in add-and-save-execution-metadata-xform!):

  • is_sandboxed: already implemented via this pattern
  • is_impersonated: :impersonation/role is set by EE impersonation middleware which runs INSIDE the QP pipeline, AFTER query-execution-info captures data. Must follow is_sandboxed pattern - have impersonation middleware set value on result metadata.
  • is_db_routed: :destination-database/id is set by attach-destination-db-middleware during preprocessing, also INSIDE the QP pipeline. Same timing issue as is_impersonated. Must follow is_sandboxed pattern - have routing middleware set value on result metadata via swap-destination-db execution middleware in database_routing/middleware.clj.

Why Category B can't use query-time capture: query-execution-info runs in the around middleware BEFORE the inner qp call that triggers preprocessing. Both :impersonation/role and :destination-database/id are set during preprocessing, so they're not yet on the query when query-execution-info reads it.

Failed queries: For failed queries, save-failed-query-execution! saves the execution-info map built before the query ran. Category A fields will be present. Category B fields (is_sandboxed, is_impersonated, is_db_routed) will naturally be absent since the query never produced results. This matches existing is_sandboxed behavior.

Ticket 7: Analytics Views - Update v_view_log, v_query_log, Create v_tenants, Update v_users

Description: Update analytics views to expose new columns and derived surface field.

Files to create:

  • resources/migrations/instance_analytics_views/view_log/v4/postgres-view_log.sql (+ mysql, h2)
  • resources/migrations/instance_analytics_views/query_log/v5/postgres-query_log.sql (+ mysql, h2)
  • resources/migrations/instance_analytics_views/tenants/v1/postgres-tenants.sql (+ mysql, h2)
  • resources/migrations/instance_analytics_views/users/v5/postgres-users.sql (+ mysql, h2)

Migration file: Add changesets to resources/migrations/060/<timestamp>_analytics_views.yaml

v_view_log v4 additions: has_access, context, embedding_client, embedding_version, surface (CASE), auth_method, tenant_id, embedding_hostname, embedding_path, sanitized_user_agent, ip_address

v_query_log v5 additions: auth_method, tenant_id, is_sandboxed, is_impersonated, is_db_routed, parameters, embedding_client, embedding_version, surface (CASE), embedding_hostname, embedding_path, sanitized_user_agent, ip_address

v_tenants v1: New view exposing tenant_id, name, slug, is_active, attributes, created_at, updated_at, tenant_collection_id (qualified). No active_user_count - breaks pattern of analytics views exposing raw data. Aggregate can be done in dashboard layer via JOIN to v_users (which gets tenant_id in v5).

v_users v5: Add tenant_id and tenant_qualified_id columns

Notes:

  • MySQL views need SQL SECURITY INVOKER
  • MySQL uses concat() instead of || for string concatenation
  • Must register v_tenants in audit-db-view-names set in enterprise/backend/src/metabase_enterprise/audit_app/permissions.clj
  • Use ${boolean.type} (not bare boolean) for boolean columns in YAML - MySQL needs bit(1)
  • Use ${text.type} for the parameters column (maps to TEXT/LONGTEXT/CLOB per database)

Rollback targets (view versions diverge between databases due to MySQL-only v3 fixes):

New Version PostgreSQL rollback MySQL rollback H2 rollback
v_view_log v4 v2 v3 v2
v_query_log v5 v4 v4 v4
v_users v5 v3 v4 v3
v_tenants v1 DROP VIEW IF EXISTS v_tenants DROP VIEW IF EXISTS v_tenants DROP VIEW IF EXISTS v_tenants

Ticket 8: Enrich Tenant Create Audit Event

Description: Include tenant name and attributes in the audit_log details when a tenant is created.

Files to modify:

  • enterprise/backend/src/metabase_enterprise/tenants/model.clj - update model-details for :model/Tenant to include :attributes

Current code (line 17-19):

(defmethod audit-app/model-details :model/Tenant
  [entity _event-type]
  (select-keys entity [:name :slug :is_active]))

Change: Add :attributes to the select-keys vector. The tenant object passed to the event already contains attributes (via present-tenant), so this is a one-line change.

Implementation Order

# Linear Description Status Dependencies
1 EMB-1500 Database migrations In Review none
2 EMB-1501 Admin setting: analytics-pii-retension-enabled In Review none
3 EMB-1502 Auth method detection and dynamic var Paused none
4 EMB-1503 Expand embedding_client vocabulary In Review none
5 EMB-1504 Populate new columns on view_log Todo EMB-1500, 1501, 1502, 1503, 1512
6 EMB-1505 Populate new columns on query_execution Todo EMB-1500, 1501, 1502, 1503, 1512
7 EMB-1506 Analytics views: v_view_log v4, v_query_log v5, v_tenants v1 Todo EMB-1500
8 EMB-1507 Enrich tenant create audit event In Review none
- EMB-1508 Test: PII setting controls gathering Todo EMB-1501
- EMB-1512 Collect and populate GDPR-gated request metadata In Review EMB-1501

EMB-1500, 1501, 1502, 1503, 1507, 1512 can all be worked in parallel. EMB-1504 and EMB-1505 can be parallelized once their dependencies are met. EMB-1506 only needs the migrations.

Verification Plan

  1. Migrations: Start Metabase against H2/Postgres, verify columns exist via REPL query
  2. Auth method: Log in via different methods (password, API key), check *auth-method* is correctly bound
  3. embedding_client expansion: Hit public/embed/metabot/agent endpoints, verify embedding_client is populated
  4. view_log population: View a dashboard via public link, verify new columns are populated
  5. query_execution population: Run a query via embedded SDK, verify auth_method, tenant_id, security flags
  6. GDPR toggle: Enable setting, verify hostname/path/user-agent/IP are captured. Disable, verify they're NULL
  7. Analytics views: Query v_view_log, v_query_log, v_tenants, v_users and verify derived surface column works correctly
  8. Audit event: Create a tenant, check audit_log details includes attributes

Test Methodology

This section maps each ticket to the existing test file it should extend, the pattern to follow, concrete test cases, and the helpers/vars needed.

Ticket 1: Migrations

File: test/metabase/app_db/schema_migrations_test.clj

Pattern to follow: workspace-input-normalization-migration-test (line 2872). Uses impl/test-migrations with a migration range vector, inserts rows before (migrate!), then verifies new columns exist and are NULL on pre-existing rows.

Test cases:

  1. view-log-new-columns-migration-test - Insert a view_log row before migration, run (migrate!), verify auth_method, tenant_id, hostname, path, user_agent, ip_address are all NULL on the pre-existing row.
  2. query-execution-new-columns-migration-test - Same pattern for query_execution. Insert a row with required fields (hash, started_at, running_time, result_rows, native, executor_id, database_id, context), migrate, verify new columns are NULL.
  3. tenant-id-index-migration-test - After migration, verify the index works by querying with a WHERE tenant_id = ? clause on both tables (implicit: if migration succeeded, index was created).

Helpers needed: impl/test-migrations, t2/insert-returning-pks!, mdb.query/query

Note: Migration IDs are placeholders until the Liquibase YAML is written. Replace "v60.YYYY-MM-DDTHH:MM:00" with actual IDs.

Ticket 2: collect-analytics-pii Admin Setting

File: test/metabase/analytics/usage_analytics_test.clj (new file - this setting has no natural home in existing test files)

Pattern to follow: Any defsetting test, e.g., mt/with-temporary-setting-values. The setting doesn't exist yet - it needs to be created first.

Test cases:

  1. collect-analytics-pii-default-test - Verify default is false
  2. collect-analytics-pii-toggle-test - Use mt/with-temporary-setting-values [collect-analytics-pii true] / false, verify round-trip
  3. collect-analytics-pii-no-feature-gate-test - Wrap in mt/with-premium-features #{}, verify the setting still works (no EE gating)
  4. collect-analytics-pii-admin-only-test - mt/user-http-request :rasta :put 403 "setting/collect-analytics-pii" verifies non-admin can't change it

Helpers needed: mt/with-temporary-setting-values, mt/with-premium-features, mt/user-http-request

Note: collect-analytics-pii does NOT exist yet. Define it (likely in metabase.analytics.core or a new metabase.analytics.settings ns) before writing tests.

Ticket 3: Auth Method Detection

File: test/metabase/analytics/usage_analytics_test.clj (same new file as Ticket 2)

Pattern to follow: Integration tests that make HTTP requests via mt/user-http-request or client/client, then inspect side effects.

Test cases:

  1. auth-method-from-session-test - Log in via password, make a request, verify *auth-method* was bound to the auth_identity.provider value (e.g., "password")
  2. auth-method-api-key-test - Create an API key with mt/with-temp [:model/ApiKey ...], make a request using it, verify *auth-method* = "api-key"
  3. auth-method-nil-when-unauthenticated-test - Hit a public route, verify *auth-method* is nil

Helpers needed: mt/with-temp, mt/user-http-request, client/client

Note: *auth-method* does NOT exist yet. It needs to be defined as a dynamic var (likely in metabase.request.current or session middleware) and bound during request processing. Tests will need to observe the bound value indirectly - e.g., by checking the auth_method column written to view_log or query_execution (which means Tickets 3+5 or 3+6 test together in practice).

Ticket 4: Expanded embedding_client

File: test/metabase/view_log/events/view_log_test.clj (extend existing public/embed tests)

Pattern to follow: get-public-card-logs-view-test (line 250). This test already:

  1. Wraps in mt/with-premium-features #{:audit-app}
  2. Enables public sharing with mt/with-temporary-setting-values [enable-public-sharing true]
  3. Creates a public card with public-test/with-temp-public-card
  4. Hits the public endpoint via client/client
  5. Asserts on latest-view

Extend these assertions to check the embedding_client value.

Test cases:

  1. Extend get-public-card-logs-view-test - After public card query, verify (:embedding_client (latest-view ...)) = "public"
  2. Extend get-embedded-card-embedding-logs-view-test (line 280) - Verify embedding_client = "guest-embed"
  3. server-side-binding-wins-over-client-header-test - Send X-Metabase-Client: embedding-sdk-react header to a public route, verify embedding_client is still "public" (server-side wins)

Current state: embedding_client already exists on view_log and query_execution. The include-sdk-info function in analytics/sdk.clj (line 34) sets it from *client* dynamic var. Currently *client* is only bound by embedding-mw (for SDK requests). For public/embed/metabot/agent routes, binding needs to happen in the route middleware.

Helpers needed: public-test/with-temp-public-card, embed-test/with-embedding-enabled-and-new-secret-key!, latest-view, client/client

Ticket 5: Populate view_log New Columns

File: test/metabase/view_log/events/view_log_test.clj

Pattern to follow: card-read-ee-test (line 27). The pattern is:

(mt/with-premium-features #{:audit-app}
  (mt/with-temp [:model/User user {} :model/Card card {:creator_id (u/id user)}]
    (events/publish-event! :event/card-read {:object-id (u/id card) :user-id (u/the-id user) :context :question})
    (is (partial= {...} (latest-view (u/id user) (u/id card))))))

Test cases:

  1. view-log-auth-method-populated-test - Bind *auth-method* (once it exists), publish a card-read event, verify :auth_method on the view_log row
  2. view-log-tenant-id-populated-test - Create a user with :tenant_id, publish event as that user, verify :tenant_id on the view_log row
  3. view-log-gdpr-fields-null-when-pii-disabled-test - With collect-analytics-pii = false, publish event via HTTP request (to get request context), verify GDPR fields are NULL
  4. view-log-gdpr-fields-populated-when-pii-enabled-test - With collect-analytics-pii = true, publish event via HTTP request, verify hostname, path, user_agent, ip_address are populated
  5. view-log-null-for-unauthenticated-requests-test - Hit public route, verify auth_method and tenant_id are NULL

Important: View logging requires #{:audit-app} premium feature. All tests must wrap in mt/with-premium-features #{:audit-app}. Without it, view events are silently dropped (see card-read-oss-no-view-logging-test at line 41 for proof).

Helpers needed: mt/with-premium-features, events/publish-event!, latest-view, mt/with-temporary-setting-values, public-test/with-temp-public-card, client/client

Note: tenant_id exists on core_user but NOT on view_log yet - the migration (Ticket 1) adds it. The view_log event handler will need to read it from the current user and write it to the new column.

Ticket 6: Populate query_execution New Columns

File: test/metabase/query_processor/middleware/process_userland_query_test.clj

Pattern to follow: success-test (line 69). The pattern uses with-query-execution! macro (line 51) which:

  1. Intercepts save-execution-metadata!* via with-redefs
  2. Captures the query-execution map in a promise
  3. Returns a function (qe) that dereferences the result

The test then asserts on specific fields of the captured map, including :is_sandboxed false (line 97).

Category A fields (captured on the Ring thread, passed via :info map):

  • auth_method - from *auth-method* dynamic var
  • tenant_id - from current user
  • GDPR fields (hostname, path, user_agent, ip_address) - from current request, gated by collect-analytics-pii

Category B fields (set on result metadata inside the QP pipeline, same thread as is_sandboxed):

  • is_impersonated - set by impersonation middleware on [:data :is_impersonated]
  • is_db_routed - set by database routing middleware on [:data :is_db_routed]

Test cases for Category A:

  1. Extend success-test assertions to include :auth_method nil, :tenant_id nil (baseline - no auth context in unit test)
  2. query-execution-auth-method-populated-test - Bind *auth-method*, run query via mt/user-http-request, check QE row
  3. query-execution-tenant-id-populated-test - Create user with tenant, run query, check QE row
  4. query-execution-gdpr-fields-gated-by-setting-test - Toggle collect-analytics-pii, run query, verify fields are NULL vs populated

Test cases for Category B: 5. query-execution-is-impersonated-test - Follow is_sandboxed pattern: check (get-in result [:data :is_impersonated]) is picked up and saved. Requires EE test setup with impersonation. 6. query-execution-is-db-routed-test - Same pattern for is_db_routed. Requires EE test setup with database routing.

Important - async save: Query execution metadata is saved asynchronously. The with-query-execution! macro handles this by intercepting save-execution-metadata!* (defined at process_userland_query.clj:43) with with-redefs and using a promise. Integration tests (Category A via HTTP) will need a different approach - either:

  • Use with-redefs on save-execution-metadata!* to capture the map, or
  • Query t2/select-one :model/QueryExecution after the request completes (with a small wait or mt/wait-for)

Helpers needed: with-query-execution! macro, mt/user-http-request, mt/with-temporary-setting-values, t2/select-one

Ticket 7: Analytics Views

File: test/metabase/app_db/schema_migrations_test.clj (for the migration that creates the views) plus a new test/metabase/analytics/analytics_views_test.clj for query-level tests.

Key reference - current view SQL: resources/migrations/instance_analytics_views/. The current views are simple:

  • view_log/v3/ - Only exposes id, timestamp, user_id, entity_type, entity_id, entity_qualified_id. Many existing columns (has_access, context, embedding_client) are NOT yet in the view.
  • query_log/v4/ - Exposes core fields plus transform_id, lens_id, lens_params, joined to query table. Does not yet expose embedding_client, is_sandboxed, etc.
  • users/v4/ - Does not yet have tenant_id.
  • tenants/ - Does not exist yet.

This means Ticket 7 is adding a LOT of new columns to these views, not just tweaking them. The surface CASE expression is entirely new logic.

Test cases:

  1. Migration test: After running migration, verify views v_view_log, v_query_log, v_tenants, v_users exist and return results
  2. v_view_log surface derivation: Insert view_log rows with different embedding_client values, query v_view_log, verify surface CASE expression maps correctly (e.g., "embedding-sdk-react" -> "sdk", "public" -> "public-link", etc.)
  3. v_tenants view: Create tenants with attributes, verify the view exposes them
  4. v_users view: Verify tenant_id column appears in the view

Ticket 8: Audit Event for Tenant CRUD

File: test/metabase/events/audit_log_test.clj or the tenant API test file

Pattern to follow: Existing audit log event tests that check (mt/latest-audit-log-entry ...).

Test cases:

  1. tenant-create-audit-event-test - Create a tenant, verify audit_log entry has :details including :attributes
  2. tenant-update-audit-event-test - Update tenant attributes, verify the audit event captures the change

Summary: Category A vs Category B

This distinction matters for query_execution (Ticket 6) because:

  • Category A fields are values available on the Ring request thread: auth_method, tenant_id, GDPR fields. They're passed into the QP via the :info map on the query, and written to the query_execution row by save-execution-metadata!.
  • Category B fields are set inside the QP middleware pipeline on result metadata: is_sandboxed (existing), is_impersonated (new), is_db_routed (new). They follow the pattern at process_userland_query.clj:116 where (get-in acc [:data :is_sandboxed]) is read from the accumulated result.

The thread boundary means Category A fields must be threaded through :info -> query-execution map, while Category B fields are picked up from the QP result. Tests for each category use different setup patterns:

  • Category A: bind dynamic vars / set request context, then run query
  • Category B: set up EE middleware (sandboxing/impersonation/routing), run query, verify the middleware set the flag on result metadata AND it was persisted

HTTP Contract

Attribute Header/Cookie Value type Currently represented in HTTP requests? Clojure representation Currently stored in DB? table.column(s)
embedding hostname Origin (header) string yes (header exists, not captured) (get-in (metabase.request.current/current-request) [:headers "origin"]) no new: view_log.embedding_hostname, query_execution.embedding_hostname
embedding path Referer (header) string yes (header exists, not captured) (get-in (metabase.request.current/current-request) [:headers "referer"]) no new: view_log.embedding_path, query_execution.embedding_path
user id metabase.SESSION (cookie) / X-Metabase-Session (header) int yes metabase.api.common/*current-user-id* yes view_log.user_id, query_execution.executor_id
tenant_id metabase.SESSION (cookie) / X-Metabase-Session (header) int yes (:tenant_id @metabase.api.common/*current-user*) yes (on core_user), not on analytics tables core_user.tenant_id; new: view_log.tenant_id, query_execution.tenant_id
surface X-Metabase-Client (header) string partial (SDK/iframe only, not public/embed/metabot/agent) metabase.analytics.sdk/*client* partial (as embedding_client) view_log.embedding_client, query_execution.embedding_client; surface derived at read time via CASE in analytics views
authentication method metabase.SESSION (cookie) / X-Metabase-Session (header) / X-Api-Key (header) / ?jwt=<token> (URL param for guest embed) / request URI (for public/embed routes) string yes (headers/cookies exist, not captured as auth_method) new *auth-method* var in request.current; from auth_identity.provider for sessions, "api-key" for API keys, "guest"/"public" for embed/public routes no new: view_log.auth_method, query_execution.auth_method
date n/a (server timestamp) timestamp n/a (t/zoned-date-time) yes view_log.timestamp, query_execution.started_at
sanitized user agent User-Agent (header) string yes (header exists, not captured) (get-in (metabase.request.current/current-request) [:headers "user-agent"]) no new: view_log.sanitized_user_agent, query_execution.sanitized_user_agent
ip address X-Forwarded-For (header) / remote-addr string yes (metabase.request.util/ip-address request) yes (login_history.ip_address only) login_history.ip_address; new: view_log.ip_address, query_execution.ip_address
query filters n/a (query info) text (JSON) n/a (:parameters query) (top-level on the query map) no new: query_execution.parameters
is download? n/a (query context) boolean (derived) n/a query context keyword (:csv-download, :embedded-csv-download, etc.); derive boolean via (str/ends-with? context "download") yes query_execution.context (already distinguishes all download types). No new column - derive in analytics view CASE.
row and column security? n/a (query middleware) boolean n/a set in sandboxing middleware result metadata yes query_execution.is_sandboxed
db routing? n/a (query middleware) boolean n/a set on result metadata by swap-destination-db middleware (Category B - same pattern as is_sandboxed) no new: query_execution.is_db_routed
connection impersonation? n/a (query middleware) boolean n/a set on result metadata by impersonation middleware (Category B - same pattern as is_sandboxed) no new: query_execution.is_impersonated
tenant attributes n/a JSON text n/a (:attributes (t2/select-one :model/Tenant :id tenant-id)) yes tenant.attributes; exposed via new v_tenants view
active tenants n/a n/a n/a (t2/select :model/Tenant :is_active true) yes tenant.is_active; exposed via new v_tenants view
active tenant users n/a n/a n/a join core_user on tenant_id yes core_user.tenant_id + core_user.is_active; queryable via v_users (new tenant_id column) joined to v_tenants
published dashboards/questions n/a n/a n/a already queryable yes v_content.made_public_by_user (non-null = published). No changes needed.
public embeds n/a n/a n/a already queryable yes v_content.is_embedding_enabled. No changes needed.

HTTP routes that generate analytics data

HTTP Verb HTTP Route What it populates Context value
GET /api/card/:id view_log :question
GET /api/dashboard/:id view_log :dashboard
GET /api/collection/:id view_log :collection
POST /api/dataset query_execution :ad-hoc
POST /api/card/:id/query query_execution :question
POST /api/dashboard/:id/dashcard/:id/card/:id/query query_execution :dashboard
GET /api/public/card/:uuid/query query_execution :public-question
GET /api/public/dashboard/:uuid/... query_execution :public-dashboard
GET /api/embed/card/:token/query query_execution :embedded-question
GET /api/embed/dashboard/:token/... query_execution :embedded-dashboard
POST /api/card/:id/query/:export-format query_execution :csv-download / :xlsx-download / :json-download
POST /api/embed/card/:token/query/:export-format query_execution :embedded-csv-download / etc.
POST /api/public/card/:uuid/query/:export-format query_execution :public-csv-download / etc.

Open Questions (from tech doc)

  1. User-Agent sanitization: Current decision is to store raw and display safely. No stripping needed unless injection risk is identified.
  2. embedding_path truncation: Strip query params first, then truncate to 2048. Implementation detail for Ticket 5/6.
  3. Unknown X-Metabase-Client values: Pass through as-is in the surface CASE expression (the ELSE clause).

Settled Decisions

  • Auth identity reliability: core_session.auth_identity_id reliably points to the auth identity used to create that session (set at creation time in create-session-with-auth-tracking!). Safe to LEFT JOIN for auth_method.
  • Tenant context for unauthenticated requests: NULL tenant_id for public/embed requests is fine.
  • No backfill: Don't backfill auth_method or tenant_id for historical rows.
  • OIDC provider granularity: No need to distinguish between multiple OIDC providers. Raw "oidc" string is sufficient.
  • Index strategy: Out of scope for this project. Only the planned tenant_id indexes.

Settled Decisions (from Slack 2026-03-30)

Preview handling (EMB-1503)

Decision: Keep the -preview suffix on embedding_client values.

Context: The embedding-mw appends -preview to the client value when X-Metabase-Embedded-Preview: true is set (e.g., embedding-sdk-react becomes embedding-sdk-react-preview). The question was whether preview requests should be grouped with their non-preview counterpart or kept distinct.

Resolution (Poom + Nicolo):

  • Preview requests must be distinguished from production usage for auditing - admins can query sensitive data via the embed preview wizard, and this should appear in audit trails (Nicolo)
  • Usage analytics views (Ticket 7) should de-emphasize preview but still surface it - the main use case is admins viewing embedded user behavior, preview is the edge case (Poom)
  • Preview requests should NOT be ignored/skipped entirely because of the auditing concern

Implementation: Two TODOs in sdk.clj resolved with comments documenting this decision. No behavioral changes - the -preview suffix was already being appended.

Tracer Bullet Strategy

Approach

Rather than building all inputs, then all DB writes, then all views in sequence, we wire each value end-to-end (request -> DB table -> analytics view) to prove the pipeline works, then repeat for each additional value. This catches integration issues early.

Value Matrix

Values dictated by 1 input:

Value Input Source DB Column(s) View Status
embedding_client Route prefix OR X-Metabase-Client header view_log + query_execution v_query_log (needs surface CASE) Wired - middleware binds, tests verify DB values
embedding_version X-Metabase-Client-Version header view_log + query_execution v_query_log Wired
preview suffix X-Metabase-Embedded-Preview header appended to embedding_client same as above Wired - tested
tenant_id current-user tenant view_log + query_execution needs view col Migration done, not yet populated
embedding_hostname Origin header view_log + query_execution needs view col Extraction fn done (pii-request-info), not wired to writes
embedding_path Referer header (path only) view_log + query_execution needs view col Extraction fn done, not wired to writes
sanitized_user_agent User-Agent header view_log + query_execution needs view col Extraction fn done, not wired to writes
ip_address Request remote addr view_log + query_execution needs view col Extraction fn done, not wired to writes
is_impersonated Connection impersonation flag (Category B - QP result metadata) query_execution only needs view col Migration done, not yet populated
is_db_routed DB routing flag (Category B - QP result metadata) query_execution only needs view col Migration done, not yet populated
parameters Query filter values (JSON) query_execution only needs view col Migration done, not yet populated

Values dictated by multiple inputs:

Value Input Sources DB Column Status
auth_method Session type (auth_identity.provider) + route (public, guest-embed) + API key header view_log + query_execution Dynamic var + include-sdk-info done, not yet bound in auth middleware

Auth Method Wiring (from permissions-expert analysis)

Where auth method is already known: At session creation time via auth_identity.provider. The core_session table has auth_identity_id FK to auth_identity, which has a provider column.

Recommended binding approach: Derive at request time by LEFT JOINing auth_identity in session-with-id-query (session.clj:99-140). This adds provider to every session lookup with zero extra queries (indexed FK lookup on a single row).

  • Modify session-with-id-query to add [:auth_identity.provider :auth-method] to SELECT
  • For API keys: hardcode "api-key" in current-user-info-for-api-key
  • For public/embed routes: bind based on route prefix in embedding-mw alongside *client*
  • Bind *auth-method* in bind-current-user from the request map

PII + Tenant Wiring (from platform-expert analysis)

view_log write path: Request -> event published -> view_log/events/view_log.clj:generate-view -> record-views! -> t2/insert! -> before-insert hook calls analytics/include-sdk-info

query_execution write path: process-userland-query-middleware -> save-execution-metadata! -> include-sdk-info called at line 56 (BEFORE async dispatch) -> async agent thread does t2/insert-returning-pk!

tenant_id: Available via (:tenant_id @api/*current-user*). No *current-tenant* var. Must capture before async boundary.

PII fields: Best bound in embedding-mw alongside *client* and *version*. No need for separate middleware - same request, same tables, same dynamic-var-to-DB-write pattern. PII columns are named embedding_* so they're embedding-scoped.

is_impersonated / is_db_routed: Category B - set inside QP pipeline on result metadata. Must follow is_sandboxed pattern. perms/impersonated-user? exists for impersonation detection. is_db_routed has no existing flag - needs to be surfaced from routing middleware.

Analytics Views Upgrade (from enterprise-expert analysis)

View upgrade process: Create new version directory under resources/migrations/instance_analytics_views/<view_name>/v<N>/ with postgres/mysql/h2 SQL files. Add migration changeset to 060/ YAML. Rollback section points to previous version's SQL files.

Surface CASE expression: Goes directly in the SQL view definition (no separate derived view). Consistent with how views already do column renaming, type casting, and coalescing. No precedent for layered views.

PII in views: Include columns in views, gate collection at write time (Option A). SQL views can't conditionally include columns based on runtime settings. The pii-request-info function already handles the extraction; the GDPR setting gates whether it's called.

Two separate analytics systems:

  • internal_stats/query_executions.clj feeds Snowplow/phone-home telemetry (aggregate counts sent to Metabase HQ)
  • Analytics views (v_query_log, v_view_log) feed instance-level usage dashboards for admins

These should use consistent classification logic but serve different purposes. The query_executions.clj CASE expressions may also need updating for newer client values (public, guest-embed, metabot, agent-api).

Auth Method Testing Feasibility

Auth Method Test Infrastructure Feasibility
password mt/user-http-request - standard test helper Easy
API key Tests in api_keys/api_test.clj, header-based Easy
LDAP with-ldap-server! macro - in-memory LDAP server Feasible
SAML with-saml-default-setup! macro, test cert at test_resources/sso/auth0-public-idp.cert Feasible - mature infra
JWT with-jwt-default-setup! macro (random secret per run) Feasible - mature infra
Google Google OAuth test setup exists Feasible
OIDC Enterprise OIDC test setup Feasible
public Existing view_log tests verify embedding_client "public" in DB Easy
guest-embed Existing embed test helpers Easy

Testing approach: Use existing SSO test infra to perform login, make subsequent request with session, assert auth_method written to view_log/query_execution (or captured via test helper).

Recommended First Tracer Bullets

  1. auth_method for password + API key - bind *auth-method* in session middleware, verify it lands in view_log/query_execution. Easiest to test, proves the full pipeline.
  2. PII fields - wire pii-request-info into view_log/query_execution write path (EMB-1504/1505), GDPR-gated
  3. tenant_id - bind from current user context, verify in both tables

Open Questions (for team discussion)

  1. Should parameters be GDPR-gated? Filter parameter values can contain user-identifiable information (e.g., filtering by email, customer name). Should the parameters column on query_execution also be gated by the collect-analytics-pii setting? Or are parameter values considered non-PII?

  2. Embedding client precedence. RESOLVED: The endpoint's server-side binding wins (unconditional with-client!). The endpoint determines which surface the request is actually served through, regardless of what client header was sent.

  3. Parameters column redundancy. The full query (including parameters) is already stored as json_query on query_execution. Storing parameters separately adds queryability but also doubles that data. Worth it?

  4. NULL auth_method semantics. When auth_method is NULL (old sessions before auth_identity migration, background jobs, pulse rendering), should analytics treat this as "unknown" or "internal"? Should we explicitly bind "internal" for background jobs?

  5. as-admin escalation. RESOLVED: Safe. as-admin calls do-with-current-user which only rebinds the 7 vars it explicitly names (*current-user-id*, *is-superuser?*, etc.). A separately-bound *auth-method* (in middleware, outside do-with-current-user) is NOT touched by as-admin - Clojure binding only shadows explicitly named vars. The original auth method is correctly retained.

Review Feedback (from expert subagents, 2026-03-24)

Changes to adopt

  1. Setting name: Use embedding-analytics-collect-pii (not embedded-) to match existing embedding-* prefix convention in embedding/settings.clj.

OUTCOME For this: we actually should just call it collect-analytics-pii since it is not tied to embedding. see:

[3:16 PM]Bryan  Question about setting name PII: Our product/tech docs have it called embedded-analytics-collect-pii.

1. embedding-analytics-collect-pii (not embedded-) better matches the existing embedding-* prefix convention in embedding/settings.clj

but the bigger thing is: is PII collection actually embedding specific? Should we capture that info for non-embedding requests too? if so we will change the name even more
[7:35 AM]roman  I think this isn’t embedding-specific
  1. Feature gate: Add :feature :embedding from the start. OVERRIDDEN: No feature gate. Setting is not embedding-specific (per Roman), available to all instances.

  2. Auth method values - use raw provider strings: Don't map "password" -> "cookie". Use the raw auth_identity.provider value directly. This gives us "password", "google", "ldap", "oidc", "custom-oidc", "slack-connect", "support-access-grant" for free without a mapping table. The provider describes how the user proved their identity - the transport mechanism (cookie vs header) is orthogonal.

  3. embedding_client binding precedence: Use (or *client* "public") / (or *client* "guest-embed") instead of unconditional rebinding. OVERRIDDEN: Use unconditional with-client! binding. The endpoint's server-side value wins on conflicts - the endpoint determines which surface the request is served through.

  4. Guest/Public binding location: Create thin wrapper functions in routes.clj (e.g., with-analytics-bindings) rather than modifying security-oriented wrappers (+public-exceptions, +message-only-exceptions). Keeps analytics concerns separate from security concerns.

  5. Drop active_user_count from v_tenants: It breaks the pattern of analytics views exposing raw data. The join can be done in the dashboard layer using the tenant_id being added to v_users. No other analytics view computes aggregates inline.

  6. Use ${text.type} not text: The YAML should use ${text.type} for the parameters column (maps to TEXT/LONGTEXT/CLOB per database). Same pattern as lens_params.

  7. View rollback correctness: v_view_log rollback is tricky - current version is v2 for postgres/h2 but v3 for mysql. Rollbacks must reference the correct per-database previous version.

Confirmed safe (no changes needed)

  • Session query JOIN performance: Adding LEFT JOIN to auth_identity is negligible - PK lookup on one row, both sides indexed.
  • Impersonation detection: (boolean (:impersonation/role query)) is reliable - key only set when impersonation genuinely applied.
  • Database routing detection: (boolean destination-database-id) is reliable. CORRECTED: is_db_routed has the same timing issue as is_impersonated - must use result-metadata pattern. See Critical Issue #3 above.
  • Tenant table on OSS: Table exists on all instances (not EE-gated). v_tenants view will work everywhere (zero rows on OSS).
  • (:tenant_id @api/*current-user*) without defenterprise: Works fine - column is in default-user-columns, naturally NULL on OSS.
  • Qualified ID format: 'tenant_' || tenant_id is consistent with all other views. tenant_collection_id is NOT NULL so no nullable concat concern.
  • v_tenants in audit-db-view-names: Safe to add even before view is synced - permission checks simply ignore unknown tables.
  • Scheduled tasks (pulses): *auth-method* will be nil/NULL for pulse queries. Combined with context = 'pulse', this is unambiguous and correct.
  • Auth method in analytics views: Low security risk - only visible to admins who already have access to login history and user details.

Critical issues found (2026-03-25 review)

  1. CRITICAL: Two thread boundaries lose dynamic vars for query_execution. The streaming response body runs on a different thread (streaming thread pool) from the Ring request thread, and save-execution-metadata! then submits work to yet another thread (pooled executor). Dynamic vars like *current-user*, *auth-method*, and (current-request) are NOT available on either thread.

    Fix: All new column values must be captured as data in the query :info map BEFORE entering the streaming response - following the existing executed-by pattern. Do NOT read dynamic vars in query-execution-info. Instead:

    • Best injection point: qp/userland-query in src/metabase/query_processor.clj (lines 101-108) - the single centralized function ALL userland query entry points call
    • Add a helper embedding-analytics-info that reads dynamic vars on the Ring thread and returns a map of :auth-method, :tenant_id, and GDPR fields
    • Merge this into :info inside userland-query
    • Read them from (:info query) in query-execution-info
    • Also add these keys to the ::lib.schema.info/info Malli schema in src/metabase/lib/schema/info.cljc
  2. CRITICAL: is_impersonated will always be false if set in query-execution-info. The :impersonation/role key is set by EE impersonation middleware which runs INSIDE the QP pipeline, AFTER process-userland-query-middleware has already captured query-execution-info.

    Fix: Follow the is_sandboxed pattern - have the impersonation middleware set the value on result metadata, then extract it in add-and-save-execution-metadata-xform! alongside is_sandboxed.

  3. CRITICAL: is_db_routed has the same timing issue as is_impersonated. :destination-database/id is set by attach-destination-db-middleware during preprocessing, which runs INSIDE the QP pipeline, AFTER query-execution-info captures data. The original report incorrectly stated this was set before the pipeline.

    Fix: Same as is_impersonated - set :is_db_routed on result metadata in the swap-destination-db execution middleware (database_routing/middleware.clj), then extract in add-and-save-execution-metadata-xform!.

Risks to be aware of

  • GDPR data retention: Once PII is collected and the setting is later disabled, historical PII remains visible through the views. The setting only controls future writes.

OUTCOME: Inline columns, purge deferred. All PII columns are nullable, so future purge = batched UPDATE SET column = NULL. Costly on large tables (tens of millions of rows) but acceptable since PII collection is off by default this phase. Existing PII precedent: login_history stores ip_address and device_description inline with no cleanup. Separate FK'd PII tables (purge = TRUNCATE, sub-second) were considered but deferred. Slack consensus: setting stays off this phase; Roman says don't paint into a corner; Ben confirms nullable columns = no technical limitation to future deletion.

Slack thread for reference:

[11:11 AM]Bryan  Something we should be aware of for GDPR: Once PII is collected and the setting is later disabled, historical PII will remain visible through the views. The setting only controls future writes. Is that something we should handle  as part of this project?
[11:15 AM]sameer  good point. I Think flipping the bit should be considered a pretty serious action.
For actual compliance, we'd need to re-write history and scrub PII[11:46 AM]roman  How difficult is it to delete this data when toggling off? Straightforward, I guess, if so let's do it for the first iteration 
[11:48 AM]sameer  depends on how many views there are
[4:39 PM]Ben Grabow  My understanding is that for this phase of the project we're building under the assumption that the setting will be turned off (by default off and remains off). At some later point we'll start another project/phase of project where we start turning the setting on, and we'll need to tackle the GDPR issues then. Not saying we shouldn't think about them but pragmatically speaking I don't think we need to address these concerns in the branch we're working on this week. (edited) 
[7:04 AM]roman  @Ben Grabow here it feels more important not to rush into building something, but to make sure we’re not doing anything that would make it hard to remove this data later. So if we have a plan and a clear understanding of what to do, that’s fine too.
So ideally we’d want to answer the hardest questions first, rather than getting stuck on them at the end[8:35 AM]Ben Grabow  I agree, we shouldn't paint ourselves into a corner and we should make sure that historical PII can be removed. I just wanted to clarify the scope of this phase of the project.

When we get to the phase when we are building a way to remove historical PII, it seems that we know which columns will contain PII, we will make those columns nullable, so in principle there is no technical limitation to deleting the historical PII.

There may be some UX and performance questions to answer. If we put the PII columns in their own table FK'd to the main tables, then deleting the PII would be as simple as dropping the table. Do we have precedent for storing PII elsewhere in the system or is this the first time?
  • Legacy sessions: auth_identity_id can be NULL on core_session for pre-migration sessions. LEFT JOIN handles this (returns NULL provider). Analytics should treat NULL auth_method from session requests as "unknown/legacy".
  • Index creation on large tables: CREATE INDEX takes a SHARE lock on Postgres which blocks writes. For very large view_log/query_execution tables in production, consider CREATE INDEX CONCURRENTLY (requires raw SQL changeset for Postgres, standard createIndex for MySQL/H2).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment