Branch: enhance-usage-analytics-for-embedding
Date: 2026-03-24
Stakeholders: Roman (spec owner), Vamsi (PM), John Swanson (tech design), Ben Grabow (implementation), Alessio Laiso (hackathon demo/mockup), Eliot (optional)
Tech doc: Linear doc
Metabase's embedding customers lack visibility into how their embedded content is being used. The view_log and query_execution tables already capture some embedding context (embedding_client, embedding_version since v51), but don't track auth method, tenant, GDPR-sensitive request metadata, or security enforcement flags.
This project extends both tables with richer context columns, creates/updates analytics views to expose them, and adds a GDPR toggle setting. The goal is to power an "Embedded Analytics" dashboard (demoed at hackathon by Alessio) showing embedding views, usage trends, tenant activity, and content popularity.
Collaboration model: This is a pairing project between Bryan and Ben Grabow. The embedding team is frontend-only, so backend help is needed (similar to past SDK admin settings work). Vamsi framed this as a full SDLC exercise - not just writing code, but the entire lifecycle: how to start, track, announce, provide updates in the project channel, ask people to test, etc.
Ben's personal goal: Wants to use this project to dive deep and understand things in detail rather than surface-level. Vamsi acknowledged this and noted the project is straightforward enough to be a good container for that.
Scope boundary: Our deliverable is the server-side work - migrations, populating the new columns, updating analytics views. Once our PR merges to master, a separate PR (done by the embedding/frontend team) will enhance the UI dashboards to consume this new data. We don't need to build the dashboards.
Timeline: Vamsi's rough expectation is 3-5 days. Reality is reality - communicate if it takes longer.
Testing: Should test with actual embedding scenarios (SDK, iframe, public links) and verify context is being captured into the database. Loom video updates encouraged in the project Slack channel.
Usage analytics system: The analytics views system was recently overhauled (by Bronza/someone). There's a mesh command that creates a local database and pulls YAML from resources, making it easy to change views and serialize back. Vamsi will share the relevant doc in the project Slack channel.
Data flow: Need to get HTTP header data (embedding client, user agent, IP, etc.) from request context and propagate it to where view_log entries are created. Bryan noted there's already infrastructure for this (metabase.analytics.sdk middleware, (metabase.request.current/current-request)).
- Surface derivation priority: Endpoint-based surface wins over X-Metabase-Client header. If no endpoint match, check header. If no header, default to "internal". (Roman confirmed explicitly.)
- Embedding client expansion: Currently only 3 values exist (
embedding-sdk-react,embedding-simple,embedding-iframe). Embedding team will addembedding-iframe-full-app,embedding-iframe-static,embedding-iframe-public, andembedding-mcpas new client-side header values. Backend should handle unknown values as pass-through. - Auth method: Roman described the need to track how sessions were generated (JWT, SAML, etc.). Confirmed that
auth_identity.provider(which already exists) provides this mapping. May be cut from first PR if complex; NULL auth_method is acceptable as a starting point. - is_download: Already derivable from
query_execution.context(ends-with "download"). No new column needed. - Active tenants/users: Roman prefers views with actual rows (not just counts) so users can build their own count queries.
- Tenant audit event: Roman demoed that tenant creation has empty details in audit log. Needs fix (Ticket 8).
- Scope: Our deliverable is server-side only (migrations, data population, views). Embedding team does the dashboards/UX separately.
- view_log: Records content views (dashboards, questions, collections). Has
embedding_clientandembedding_versionbut not exposed in thev_view_loganalytics view. Hashas_accessandcontextcolumns also unexposed. - query_execution: Records query runs. Has
embedding_client,embedding_version,is_sandboxed, andcontext(which already distinguishes download types). Also not fully exposed inv_query_log. - SDK middleware (
metabase.analytics.sdk): Binds*client*and*version*fromX-Metabase-Client/X-Metabase-Client-Versionheaders. Currently only covers SDK/iframe clients - doesn't cover public, guest-embed, metabot, or agent-api surfaces. - Session middleware (
metabase.server.middleware.session): Resolves current user from session or API key. Joins toauth_identitytable (which has aprovidercolumn with values like "password", "jwt", "saml", "google", "ldap", "oidc"). Currently does NOT select the provider during session resolution. - Tenant model: EE-only.
core_user.tenant_idexists and is loaded via@api/*current-user*. Tenant table has: id, name, slug, is_active, attributes, created_at, updated_at, tenant_collection_id.
| Area | Change | Tables |
|---|---|---|
| Auth tracking | auth_method varchar(32) |
view_log, query_execution |
| Tenant tracking | tenant_id integer |
view_log, query_execution |
| Security flags | is_impersonated, is_db_routed booleans |
query_execution only |
| Query parameters | parameters text (JSON) |
query_execution only |
| GDPR-gated | embedding_hostname, embedding_path, sanitized_user_agent, ip_address |
view_log, query_execution |
| Surface derivation | Computed from embedding_client at read time via CASE |
analytics views only |
| Admin setting | collect-analytics-pii boolean (default false, no feature gate) |
settings table |
| New view | v_tenants |
analytics views |
| View updates | v_view_log v4, v_query_log v5, v_users v5 |
analytics views |
- Extend existing tables rather than new tables - simpler, no joins needed
- Both tables get context columns because downloads/API calls only hit query_execution (no view_log entry)
- Surface is derived at read time from
embedding_clientvia CASE expression - if mapping evolves, historical data updates automatically - Auth method determined at request time and bound as dynamic var - avoids per-event DB queries for stateless auth
- GDPR columns always exist but only populated when admin setting enabled
- Analytics view SQL lives in
resources/migrations/instance_analytics_views/- one subdirectory per view, versioned (v1/,v2/, ...), with per-database SQL files (postgres, mysql, h2). MySQL views needSQL SECURITY INVOKERand useconcat()instead of||.
Description: Add new nullable columns to view_log and query_execution tables, plus indexes on tenant_id.
Files to create/modify:
resources/migrations/060/<timestamp>_embedding_analytics.yaml- new migration file
Columns for view_log: auth_method (varchar 32), tenant_id (int), embedding_hostname (varchar 512), embedding_path (varchar 2048), sanitized_user_agent (varchar 512), ip_address (varchar 45)
Columns for query_execution: auth_method (varchar 32), tenant_id (int), is_impersonated (boolean), is_db_routed (boolean), parameters (text), embedding_hostname (varchar 512), embedding_path (varchar 2048), sanitized_user_agent (varchar 512), ip_address (varchar 45)
Indexes: idx_view_log_tenant_id, idx_query_execution_tenant_id
Notes:
- Follow existing pattern of one changeset per column per table (see v51.2024-07-22 for embedding_client/version precedent)
- Adding nullable columns is metadata-only on Postgres/MySQL - no table rewrite
- Use
${boolean.type}for booleans (maps tobit(1)on MySQL) - Latest v60 timestamp is ~2026-03-17, so use 2026-03-24+
Description: Add a boolean admin setting that gates collection of GDPR-sensitive fields.
Files to modify:
src/metabase/embedding/settings.clj- add defsetting (or a more general namespace, since this is not embedding-specific)
Implementation:
(defsetting collect-analytics-pii
(deferred-tru "Controls whether analytics collects personally identifiable information (hostname, path, user agent, IP).")
:type :boolean
:default false
:visibility :admin
:export? true)Notes: Per Roman, this is NOT embedding-specific - PII collection applies to all surfaces. No :feature gate (available to all instances). Default false means GDPR columns won't be populated until an admin explicitly enables it. Reading a feature-gated setting on non-EE instances returns the default safely (no exception), but since there's no gate here, this is moot.
Description: Determine auth_method during request processing and make it available as a dynamic var for view_log/query_execution population.
Files to modify:
src/metabase/request/current.clj- add*auth-method*dynamic var (NOTapi/common.cljwhich has TODOs to migrate vars away;request.currentis the intended home for new request-scoped vars)src/metabase/server/middleware/session.clj- add auth_identity.provider to session query; set auth-method inmerge-current-user-infosrc/metabase/request/session.clj- bind*auth-method*in middleware layer OUTSIDEdo-with-current-user(soas-admindoesn't touch it - Clojurebindingonly shadows explicitly named vars)
Auth method values and where each is detected:
All session-based methods flow through auth_identity.provider via the LEFT JOIN in session-with-id-query. The session is created with an auth_identity_id FK during the login flow for each provider.
| Value | How user authenticates | Where session/identity is created | Where detected at request time |
|---|---|---|---|
password |
POST /api/session with email+password |
auth-identity.provider/login! :provider/password -> auth-identity.session/create-session-with-auth-tracking! |
LEFT JOIN auth_identity.provider in session-with-id-query |
jwt |
GET /auth/sso?jwt=<token> |
metabase-enterprise.sso.integrations.jwt/sso-get -> login! :provider/jwt -> create-session-with-auth-tracking! |
LEFT JOIN auth_identity.provider |
saml |
GET /auth/sso (redirect) -> POST /auth/sso (IdP callback) |
metabase-enterprise.sso.integrations.saml/sso-post -> login! :provider/saml -> create-session-with-auth-tracking! |
LEFT JOIN auth_identity.provider |
google |
POST /auth/sso with Google OAuth token |
metabase.sso.providers.google -> login! :provider/google -> create-session-with-auth-tracking! |
LEFT JOIN auth_identity.provider |
ldap |
POST /api/session (LDAP enabled, checked before password) |
session.api/login -> login! :provider/ldap -> create-session-with-auth-tracking! |
LEFT JOIN auth_identity.provider |
oidc |
Enterprise OIDC flow | login! :provider/oidc -> create-session-with-auth-tracking! |
LEFT JOIN auth_identity.provider |
custom-oidc |
Enterprise custom OIDC | login! :provider/custom-oidc -> create-session-with-auth-tracking! |
LEFT JOIN auth_identity.provider |
slack-connect |
Slack Connect auth | login! :provider/slack-connect -> create-session-with-auth-tracking! |
LEFT JOIN auth_identity.provider |
support-access-grant |
Support access grant flow | login! :provider/support-access-grant -> create-session-with-auth-tracking! |
LEFT JOIN auth_identity.provider |
api-key |
X-Api-Key header on any request |
No session created. Key created via advanced-config.file.api-keys/init-from-config-file! or API. Dedicated :type :api-key user. |
Hardcoded in current-user-info-for-api-key -> (assoc :auth-method "api-key") |
guest |
/api/embed/:token (signed JWT in URL) |
No session. Stateless - JWT verified per request by metabase.embedding-rest.api.embed. No user identity established. |
TODO: Bind *auth-method* in embed route middleware (Ticket 4 scope) |
public |
/api/public/:uuid (public UUID in URL) |
No session. Stateless - UUID verified per request by metabase.public-sharing-rest.api. No user identity. |
TODO: Bind *auth-method* in public route middleware (Ticket 4 scope) |
| NULL | Legacy sessions (pre-auth_identity migration), pulse/background jobs | Pre-existing core_session rows without auth_identity_id |
LEFT JOIN returns NULL (correct) |
Request-time flow for session-based methods:
- User authenticates via one of the provider login flows above
auth-identity.session/create-session-with-auth-tracking!createscore_sessionrow withauth_identity_idFK- Session key returned as cookie (
metabase.SESSION) orX-Metabase-Sessionheader - On subsequent requests:
wrap-session-keyextracts session key ->wrap-current-user-inforunssession-with-id-querywhich LEFT JOINsauth_identityto get provider -> merged onto request as:auth-method bind-current-userwraps handler in(sdk/with-auth-method! (:auth-method request) ...)to bind the dynamic var
Request-time flow for API keys:
wrap-current-user-info->current-user-info-for-api-keymatches key prefix, verifies bcrypt hash- Returns
(assoc :auth-method "api-key")- noauth_identitytable involved - Same
bind-current-userbinding as above
Request-time flow for embed/public (TODO - not yet implemented):
- These routes don't go through session resolution at all
- Need server-side
(sdk/with-auth-method! "guest" ...)or"public"binding in route middleware - Will be done alongside Ticket 4's
+embedding-clientroute wrapper
Implementation approach:
- Add LEFT JOIN to
auth_identityin thesession-with-id-query(compiled, memoized query at line ~99-140 of session.clj). This adds provider to every session lookup with zero extra queries. - For API keys: return
{:auth-method "api-key" ...}fromcurrent-user-info-for-api-key - For guest/public: these don't go through session resolution. Bind
*auth-method*in the respective API route handlers or wrapping middleware (Ticket 4).
Critical file references:
- Session query:
src/metabase/server/middleware/session.cljlines 99-140 merge-current-user-info: same file lines 235-243do-with-current-user:src/metabase/request/session.cljlines 38-61- Current user vars:
src/metabase/api/common.cljlines 104-131
Description: Set embedding_client server-side for surfaces that don't send the X-Metabase-Client header (public, guest-embed, metabot, agent-api).
Files to modify:
src/metabase/public_sharing_rest/api.clj(or middleware) - bindanalytics.sdk/*client*to"public"src/metabase/embedding_rest/api/embed.clj(or middleware) - bind*client*to"guest-embed"src/metabase/metabot/api.clj(or middleware) - bind*client*to"metabot"src/metabase/agent_api/api.clj(or middleware) - bind*client*to"agent-api"
Implementation: Each uses an unconditional with-client! binding. On conflicts (e.g., SDK iframe hitting a public route), the endpoint's server-side value wins - the endpoint determines which surface the request is actually being served through. Create a thin middleware wrapper in api_routes/routes.clj:
(defn- +embedding-client [client handler]
(fn [request respond raise]
(sdk/with-client! [client]
(handler request respond raise))))Applied to routes:
"/public" (+public-exceptions (+embedding-client "public" metabase.public-sharing-rest.api/routes))
"/embed" (+message-only-exceptions (+embedding-client "guest-embed" metabase.embedding-rest.api/embedding-routes))
"/metabot" (+embedding-client "metabot" metabase.metabot.api/routes)
"/agent" (+embedding-client "agent-api" metabase.agent-api.api/routes)Surface mapping (computed at read time in analytics views):
| embedding_client value | surface | source |
|---|---|---|
| NULL | internal | no header, no server binding |
| embedding-sdk-react | sdk | SDK header |
| embedding-sdk-react-preview | sdk-preview | SDK + X-Metabase-Embedded-Preview header |
| embedding-simple | modular | Simple/modular embedding header |
| embedding-simple-preview | modular-preview | Simple + preview header |
| embedding-iframe | iframe | iframe detection header |
| embedding-iframe-preview | iframe-preview | iframe + preview header |
| public | public | server-side binding (Ticket 4) |
| guest-embed | guest | server-side binding (Ticket 4) |
| metabot | metabot | server-side binding (Ticket 4) |
| agent-api | agent-api | server-side binding (Ticket 4) |
| (anything else) | pass through raw value | future-proofing |
Description: When creating view_log entries, include auth_method, tenant_id, and GDPR-gated fields.
Files to modify:
src/metabase/view_log/events/view_log.clj- modifygenerate-viewto include new fieldssrc/metabase/view_log/models/view_log.clj- update before-insert if needed
Implementation:
auth_method: fromrequest.current/*auth-method*dynamic var (Ticket 3)tenant_id: from(:tenant_id @api/*current-user*)- already available- GDPR fields: check
(collect-analytics-pii)setting, if true extract from(metabase.request.current/current-request):embedding_hostname: Origin header, extract hostnameembedding_path: Referer header, extract path (strip query params, truncate to 2048)sanitized_user_agent: User-Agent header (store raw)ip_address:request.util/ip-address
Description: When creating query_execution entries, include auth_method, tenant_id, security flags, parameters, and GDPR-gated fields.
Files to modify:
src/metabase/query_processor/middleware/process_userland_query.clj- modifyquery-execution-info(line 125-159) andsave-successful-execution-metadata!(line 72-78)src/metabase/queries/models/query_execution.clj- update before-insert if needed
Implementation (two categories of fields based on when data is available):
Category A - Captured in query :info map via qp/userland-query (Ring thread, before async boundary):
auth_method: from*auth-method*dynamic var (bound inrequest.current)tenant_id: from(:tenant_id @api/*current-user*)parameters: JSON-encode(:parameters query)(top-level filter values)- GDPR fields: from
(request.current/current-request)headers, gated by(collect-analytics-pii)setting
Best injection point: qp/userland-query in src/metabase/query_processor.clj (lines 101-108) - the single centralized function ALL userland query entry points call. Add a helper that reads dynamic vars and merges into :info:
(defn userland-query [query info]
(-> query
(assoc-in [:middleware :userland-query?] true)
(update :info merge info (embedding-analytics-info))))Category B - Set on result metadata (inside QP pipeline, extracted in add-and-save-execution-metadata-xform!):
is_sandboxed: already implemented via this patternis_impersonated::impersonation/roleis set by EE impersonation middleware which runs INSIDE the QP pipeline, AFTERquery-execution-infocaptures data. Must followis_sandboxedpattern - have impersonation middleware set value on result metadata.is_db_routed::destination-database/idis set byattach-destination-db-middlewareduring preprocessing, also INSIDE the QP pipeline. Same timing issue asis_impersonated. Must followis_sandboxedpattern - have routing middleware set value on result metadata viaswap-destination-dbexecution middleware indatabase_routing/middleware.clj.
Why Category B can't use query-time capture: query-execution-info runs in the around middleware BEFORE the inner qp call that triggers preprocessing. Both :impersonation/role and :destination-database/id are set during preprocessing, so they're not yet on the query when query-execution-info reads it.
Failed queries: For failed queries, save-failed-query-execution! saves the execution-info map built before the query ran. Category A fields will be present. Category B fields (is_sandboxed, is_impersonated, is_db_routed) will naturally be absent since the query never produced results. This matches existing is_sandboxed behavior.
Description: Update analytics views to expose new columns and derived surface field.
Files to create:
resources/migrations/instance_analytics_views/view_log/v4/postgres-view_log.sql(+ mysql, h2)resources/migrations/instance_analytics_views/query_log/v5/postgres-query_log.sql(+ mysql, h2)resources/migrations/instance_analytics_views/tenants/v1/postgres-tenants.sql(+ mysql, h2)resources/migrations/instance_analytics_views/users/v5/postgres-users.sql(+ mysql, h2)
Migration file: Add changesets to resources/migrations/060/<timestamp>_analytics_views.yaml
v_view_log v4 additions: has_access, context, embedding_client, embedding_version, surface (CASE), auth_method, tenant_id, embedding_hostname, embedding_path, sanitized_user_agent, ip_address
v_query_log v5 additions: auth_method, tenant_id, is_sandboxed, is_impersonated, is_db_routed, parameters, embedding_client, embedding_version, surface (CASE), embedding_hostname, embedding_path, sanitized_user_agent, ip_address
v_tenants v1: New view exposing tenant_id, name, slug, is_active, attributes, created_at, updated_at, tenant_collection_id (qualified). No active_user_count - breaks pattern of analytics views exposing raw data. Aggregate can be done in dashboard layer via JOIN to v_users (which gets tenant_id in v5).
v_users v5: Add tenant_id and tenant_qualified_id columns
Notes:
- MySQL views need
SQL SECURITY INVOKER - MySQL uses
concat()instead of||for string concatenation - Must register
v_tenantsinaudit-db-view-namesset inenterprise/backend/src/metabase_enterprise/audit_app/permissions.clj - Use
${boolean.type}(not bareboolean) for boolean columns in YAML - MySQL needsbit(1) - Use
${text.type}for the parameters column (maps to TEXT/LONGTEXT/CLOB per database)
Rollback targets (view versions diverge between databases due to MySQL-only v3 fixes):
| New Version | PostgreSQL rollback | MySQL rollback | H2 rollback |
|---|---|---|---|
| v_view_log v4 | v2 | v3 | v2 |
| v_query_log v5 | v4 | v4 | v4 |
| v_users v5 | v3 | v4 | v3 |
| v_tenants v1 | DROP VIEW IF EXISTS v_tenants | DROP VIEW IF EXISTS v_tenants | DROP VIEW IF EXISTS v_tenants |
Description: Include tenant name and attributes in the audit_log details when a tenant is created.
Files to modify:
enterprise/backend/src/metabase_enterprise/tenants/model.clj- updatemodel-detailsfor:model/Tenantto include:attributes
Current code (line 17-19):
(defmethod audit-app/model-details :model/Tenant
[entity _event-type]
(select-keys entity [:name :slug :is_active]))Change: Add :attributes to the select-keys vector. The tenant object passed to the event already contains attributes (via present-tenant), so this is a one-line change.
| # | Linear | Description | Status | Dependencies |
|---|---|---|---|---|
| 1 | EMB-1500 | Database migrations | In Review | none |
| 2 | EMB-1501 | Admin setting: analytics-pii-retension-enabled | In Review | none |
| 3 | EMB-1502 | Auth method detection and dynamic var | Paused | none |
| 4 | EMB-1503 | Expand embedding_client vocabulary | In Review | none |
| 5 | EMB-1504 | Populate new columns on view_log | Todo | EMB-1500, 1501, 1502, 1503, 1512 |
| 6 | EMB-1505 | Populate new columns on query_execution | Todo | EMB-1500, 1501, 1502, 1503, 1512 |
| 7 | EMB-1506 | Analytics views: v_view_log v4, v_query_log v5, v_tenants v1 | Todo | EMB-1500 |
| 8 | EMB-1507 | Enrich tenant create audit event | In Review | none |
| - | EMB-1508 | Test: PII setting controls gathering | Todo | EMB-1501 |
| - | EMB-1512 | Collect and populate GDPR-gated request metadata | In Review | EMB-1501 |
EMB-1500, 1501, 1502, 1503, 1507, 1512 can all be worked in parallel. EMB-1504 and EMB-1505 can be parallelized once their dependencies are met. EMB-1506 only needs the migrations.
- Migrations: Start Metabase against H2/Postgres, verify columns exist via REPL query
- Auth method: Log in via different methods (password, API key), check
*auth-method*is correctly bound - embedding_client expansion: Hit public/embed/metabot/agent endpoints, verify embedding_client is populated
- view_log population: View a dashboard via public link, verify new columns are populated
- query_execution population: Run a query via embedded SDK, verify auth_method, tenant_id, security flags
- GDPR toggle: Enable setting, verify hostname/path/user-agent/IP are captured. Disable, verify they're NULL
- Analytics views: Query v_view_log, v_query_log, v_tenants, v_users and verify derived
surfacecolumn works correctly - Audit event: Create a tenant, check audit_log details includes attributes
This section maps each ticket to the existing test file it should extend, the pattern to follow, concrete test cases, and the helpers/vars needed.
File: test/metabase/app_db/schema_migrations_test.clj
Pattern to follow: workspace-input-normalization-migration-test (line 2872). Uses impl/test-migrations with a migration range vector, inserts rows before (migrate!), then verifies new columns exist and are NULL on pre-existing rows.
Test cases:
view-log-new-columns-migration-test- Insert aview_logrow before migration, run(migrate!), verifyauth_method,tenant_id,hostname,path,user_agent,ip_addressare all NULL on the pre-existing row.query-execution-new-columns-migration-test- Same pattern forquery_execution. Insert a row with required fields (hash,started_at,running_time,result_rows,native,executor_id,database_id,context), migrate, verify new columns are NULL.tenant-id-index-migration-test- After migration, verify the index works by querying with aWHERE tenant_id = ?clause on both tables (implicit: if migration succeeded, index was created).
Helpers needed: impl/test-migrations, t2/insert-returning-pks!, mdb.query/query
Note: Migration IDs are placeholders until the Liquibase YAML is written. Replace "v60.YYYY-MM-DDTHH:MM:00" with actual IDs.
File: test/metabase/analytics/usage_analytics_test.clj (new file - this setting has no natural home in existing test files)
Pattern to follow: Any defsetting test, e.g., mt/with-temporary-setting-values. The setting doesn't exist yet - it needs to be created first.
Test cases:
collect-analytics-pii-default-test- Verify default isfalsecollect-analytics-pii-toggle-test- Usemt/with-temporary-setting-values [collect-analytics-pii true]/false, verify round-tripcollect-analytics-pii-no-feature-gate-test- Wrap inmt/with-premium-features #{}, verify the setting still works (no EE gating)collect-analytics-pii-admin-only-test-mt/user-http-request :rasta :put 403 "setting/collect-analytics-pii"verifies non-admin can't change it
Helpers needed: mt/with-temporary-setting-values, mt/with-premium-features, mt/user-http-request
Note: collect-analytics-pii does NOT exist yet. Define it (likely in metabase.analytics.core or a new metabase.analytics.settings ns) before writing tests.
File: test/metabase/analytics/usage_analytics_test.clj (same new file as Ticket 2)
Pattern to follow: Integration tests that make HTTP requests via mt/user-http-request or client/client, then inspect side effects.
Test cases:
auth-method-from-session-test- Log in via password, make a request, verify*auth-method*was bound to theauth_identity.providervalue (e.g.,"password")auth-method-api-key-test- Create an API key withmt/with-temp [:model/ApiKey ...], make a request using it, verify*auth-method*="api-key"auth-method-nil-when-unauthenticated-test- Hit a public route, verify*auth-method*is nil
Helpers needed: mt/with-temp, mt/user-http-request, client/client
Note: *auth-method* does NOT exist yet. It needs to be defined as a dynamic var (likely in metabase.request.current or session middleware) and bound during request processing. Tests will need to observe the bound value indirectly - e.g., by checking the auth_method column written to view_log or query_execution (which means Tickets 3+5 or 3+6 test together in practice).
File: test/metabase/view_log/events/view_log_test.clj (extend existing public/embed tests)
Pattern to follow: get-public-card-logs-view-test (line 250). This test already:
- Wraps in
mt/with-premium-features #{:audit-app} - Enables public sharing with
mt/with-temporary-setting-values [enable-public-sharing true] - Creates a public card with
public-test/with-temp-public-card - Hits the public endpoint via
client/client - Asserts on
latest-view
Extend these assertions to check the embedding_client value.
Test cases:
- Extend
get-public-card-logs-view-test- After public card query, verify(:embedding_client (latest-view ...))="public" - Extend
get-embedded-card-embedding-logs-view-test(line 280) - Verifyembedding_client="guest-embed" server-side-binding-wins-over-client-header-test- SendX-Metabase-Client: embedding-sdk-reactheader to a public route, verifyembedding_clientis still"public"(server-side wins)
Current state: embedding_client already exists on view_log and query_execution. The include-sdk-info function in analytics/sdk.clj (line 34) sets it from *client* dynamic var. Currently *client* is only bound by embedding-mw (for SDK requests). For public/embed/metabot/agent routes, binding needs to happen in the route middleware.
Helpers needed: public-test/with-temp-public-card, embed-test/with-embedding-enabled-and-new-secret-key!, latest-view, client/client
File: test/metabase/view_log/events/view_log_test.clj
Pattern to follow: card-read-ee-test (line 27). The pattern is:
(mt/with-premium-features #{:audit-app}
(mt/with-temp [:model/User user {} :model/Card card {:creator_id (u/id user)}]
(events/publish-event! :event/card-read {:object-id (u/id card) :user-id (u/the-id user) :context :question})
(is (partial= {...} (latest-view (u/id user) (u/id card))))))Test cases:
view-log-auth-method-populated-test- Bind*auth-method*(once it exists), publish a card-read event, verify:auth_methodon the view_log rowview-log-tenant-id-populated-test- Create a user with:tenant_id, publish event as that user, verify:tenant_idon the view_log rowview-log-gdpr-fields-null-when-pii-disabled-test- Withcollect-analytics-pii=false, publish event via HTTP request (to get request context), verify GDPR fields are NULLview-log-gdpr-fields-populated-when-pii-enabled-test- Withcollect-analytics-pii=true, publish event via HTTP request, verifyhostname,path,user_agent,ip_addressare populatedview-log-null-for-unauthenticated-requests-test- Hit public route, verifyauth_methodandtenant_idare NULL
Important: View logging requires #{:audit-app} premium feature. All tests must wrap in mt/with-premium-features #{:audit-app}. Without it, view events are silently dropped (see card-read-oss-no-view-logging-test at line 41 for proof).
Helpers needed: mt/with-premium-features, events/publish-event!, latest-view, mt/with-temporary-setting-values, public-test/with-temp-public-card, client/client
Note: tenant_id exists on core_user but NOT on view_log yet - the migration (Ticket 1) adds it. The view_log event handler will need to read it from the current user and write it to the new column.
File: test/metabase/query_processor/middleware/process_userland_query_test.clj
Pattern to follow: success-test (line 69). The pattern uses with-query-execution! macro (line 51) which:
- Intercepts
save-execution-metadata!*viawith-redefs - Captures the query-execution map in a promise
- Returns a function
(qe)that dereferences the result
The test then asserts on specific fields of the captured map, including :is_sandboxed false (line 97).
Category A fields (captured on the Ring thread, passed via :info map):
auth_method- from*auth-method*dynamic vartenant_id- from current user- GDPR fields (
hostname,path,user_agent,ip_address) - from current request, gated bycollect-analytics-pii
Category B fields (set on result metadata inside the QP pipeline, same thread as is_sandboxed):
is_impersonated- set by impersonation middleware on[:data :is_impersonated]is_db_routed- set by database routing middleware on[:data :is_db_routed]
Test cases for Category A:
- Extend
success-testassertions to include:auth_method nil,:tenant_id nil(baseline - no auth context in unit test) query-execution-auth-method-populated-test- Bind*auth-method*, run query viamt/user-http-request, check QE rowquery-execution-tenant-id-populated-test- Create user with tenant, run query, check QE rowquery-execution-gdpr-fields-gated-by-setting-test- Togglecollect-analytics-pii, run query, verify fields are NULL vs populated
Test cases for Category B:
5. query-execution-is-impersonated-test - Follow is_sandboxed pattern: check (get-in result [:data :is_impersonated]) is picked up and saved. Requires EE test setup with impersonation.
6. query-execution-is-db-routed-test - Same pattern for is_db_routed. Requires EE test setup with database routing.
Important - async save: Query execution metadata is saved asynchronously. The with-query-execution! macro handles this by intercepting save-execution-metadata!* (defined at process_userland_query.clj:43) with with-redefs and using a promise. Integration tests (Category A via HTTP) will need a different approach - either:
- Use
with-redefsonsave-execution-metadata!*to capture the map, or - Query
t2/select-one :model/QueryExecutionafter the request completes (with a small wait ormt/wait-for)
Helpers needed: with-query-execution! macro, mt/user-http-request, mt/with-temporary-setting-values, t2/select-one
File: test/metabase/app_db/schema_migrations_test.clj (for the migration that creates the views) plus a new test/metabase/analytics/analytics_views_test.clj for query-level tests.
Key reference - current view SQL: resources/migrations/instance_analytics_views/. The current views are simple:
view_log/v3/- Only exposesid,timestamp,user_id,entity_type,entity_id,entity_qualified_id. Many existing columns (has_access,context,embedding_client) are NOT yet in the view.query_log/v4/- Exposes core fields plustransform_id,lens_id,lens_params, joined toquerytable. Does not yet exposeembedding_client,is_sandboxed, etc.users/v4/- Does not yet havetenant_id.tenants/- Does not exist yet.
This means Ticket 7 is adding a LOT of new columns to these views, not just tweaking them. The surface CASE expression is entirely new logic.
Test cases:
- Migration test: After running migration, verify views
v_view_log,v_query_log,v_tenants,v_usersexist and return results v_view_logsurfacederivation: Insert view_log rows with differentembedding_clientvalues, queryv_view_log, verifysurfaceCASE expression maps correctly (e.g.,"embedding-sdk-react"->"sdk","public"->"public-link", etc.)v_tenantsview: Create tenants with attributes, verify the view exposes themv_usersview: Verifytenant_idcolumn appears in the view
File: test/metabase/events/audit_log_test.clj or the tenant API test file
Pattern to follow: Existing audit log event tests that check (mt/latest-audit-log-entry ...).
Test cases:
tenant-create-audit-event-test- Create a tenant, verify audit_log entry has:detailsincluding:attributestenant-update-audit-event-test- Update tenant attributes, verify the audit event captures the change
This distinction matters for query_execution (Ticket 6) because:
- Category A fields are values available on the Ring request thread:
auth_method,tenant_id, GDPR fields. They're passed into the QP via the:infomap on the query, and written to thequery_executionrow bysave-execution-metadata!. - Category B fields are set inside the QP middleware pipeline on result metadata:
is_sandboxed(existing),is_impersonated(new),is_db_routed(new). They follow the pattern atprocess_userland_query.clj:116where(get-in acc [:data :is_sandboxed])is read from the accumulated result.
The thread boundary means Category A fields must be threaded through :info -> query-execution map, while Category B fields are picked up from the QP result. Tests for each category use different setup patterns:
- Category A: bind dynamic vars / set request context, then run query
- Category B: set up EE middleware (sandboxing/impersonation/routing), run query, verify the middleware set the flag on result metadata AND it was persisted
| Attribute | Header/Cookie | Value type | Currently represented in HTTP requests? | Clojure representation | Currently stored in DB? | table.column(s) |
|---|---|---|---|---|---|---|
| embedding hostname | Origin (header) |
string | yes (header exists, not captured) | (get-in (metabase.request.current/current-request) [:headers "origin"]) |
no | new: view_log.embedding_hostname, query_execution.embedding_hostname |
| embedding path | Referer (header) |
string | yes (header exists, not captured) | (get-in (metabase.request.current/current-request) [:headers "referer"]) |
no | new: view_log.embedding_path, query_execution.embedding_path |
| user id | metabase.SESSION (cookie) / X-Metabase-Session (header) |
int | yes | metabase.api.common/*current-user-id* |
yes | view_log.user_id, query_execution.executor_id |
| tenant_id | metabase.SESSION (cookie) / X-Metabase-Session (header) |
int | yes | (:tenant_id @metabase.api.common/*current-user*) |
yes (on core_user), not on analytics tables | core_user.tenant_id; new: view_log.tenant_id, query_execution.tenant_id |
| surface | X-Metabase-Client (header) |
string | partial (SDK/iframe only, not public/embed/metabot/agent) | metabase.analytics.sdk/*client* |
partial (as embedding_client) | view_log.embedding_client, query_execution.embedding_client; surface derived at read time via CASE in analytics views |
| authentication method | metabase.SESSION (cookie) / X-Metabase-Session (header) / X-Api-Key (header) / ?jwt=<token> (URL param for guest embed) / request URI (for public/embed routes) |
string | yes (headers/cookies exist, not captured as auth_method) | new *auth-method* var in request.current; from auth_identity.provider for sessions, "api-key" for API keys, "guest"/"public" for embed/public routes |
no | new: view_log.auth_method, query_execution.auth_method |
| date | n/a (server timestamp) | timestamp | n/a | (t/zoned-date-time) |
yes | view_log.timestamp, query_execution.started_at |
| sanitized user agent | User-Agent (header) |
string | yes (header exists, not captured) | (get-in (metabase.request.current/current-request) [:headers "user-agent"]) |
no | new: view_log.sanitized_user_agent, query_execution.sanitized_user_agent |
| ip address | X-Forwarded-For (header) / remote-addr |
string | yes | (metabase.request.util/ip-address request) |
yes (login_history.ip_address only) | login_history.ip_address; new: view_log.ip_address, query_execution.ip_address |
| query filters | n/a (query info) | text (JSON) | n/a | (:parameters query) (top-level on the query map) |
no | new: query_execution.parameters |
| is download? | n/a (query context) | boolean (derived) | n/a | query context keyword (:csv-download, :embedded-csv-download, etc.); derive boolean via (str/ends-with? context "download") |
yes | query_execution.context (already distinguishes all download types). No new column - derive in analytics view CASE. |
| row and column security? | n/a (query middleware) | boolean | n/a | set in sandboxing middleware result metadata | yes | query_execution.is_sandboxed |
| db routing? | n/a (query middleware) | boolean | n/a | set on result metadata by swap-destination-db middleware (Category B - same pattern as is_sandboxed) |
no | new: query_execution.is_db_routed |
| connection impersonation? | n/a (query middleware) | boolean | n/a | set on result metadata by impersonation middleware (Category B - same pattern as is_sandboxed) | no | new: query_execution.is_impersonated |
| tenant attributes | n/a | JSON text | n/a | (:attributes (t2/select-one :model/Tenant :id tenant-id)) |
yes | tenant.attributes; exposed via new v_tenants view |
| active tenants | n/a | n/a | n/a | (t2/select :model/Tenant :is_active true) |
yes | tenant.is_active; exposed via new v_tenants view |
| active tenant users | n/a | n/a | n/a | join core_user on tenant_id | yes | core_user.tenant_id + core_user.is_active; queryable via v_users (new tenant_id column) joined to v_tenants |
| published dashboards/questions | n/a | n/a | n/a | already queryable | yes | v_content.made_public_by_user (non-null = published). No changes needed. |
| public embeds | n/a | n/a | n/a | already queryable | yes | v_content.is_embedding_enabled. No changes needed. |
| HTTP Verb | HTTP Route | What it populates | Context value |
|---|---|---|---|
| GET | /api/card/:id |
view_log | :question |
| GET | /api/dashboard/:id |
view_log | :dashboard |
| GET | /api/collection/:id |
view_log | :collection |
| POST | /api/dataset |
query_execution | :ad-hoc |
| POST | /api/card/:id/query |
query_execution | :question |
| POST | /api/dashboard/:id/dashcard/:id/card/:id/query |
query_execution | :dashboard |
| GET | /api/public/card/:uuid/query |
query_execution | :public-question |
| GET | /api/public/dashboard/:uuid/... |
query_execution | :public-dashboard |
| GET | /api/embed/card/:token/query |
query_execution | :embedded-question |
| GET | /api/embed/dashboard/:token/... |
query_execution | :embedded-dashboard |
| POST | /api/card/:id/query/:export-format |
query_execution | :csv-download / :xlsx-download / :json-download |
| POST | /api/embed/card/:token/query/:export-format |
query_execution | :embedded-csv-download / etc. |
| POST | /api/public/card/:uuid/query/:export-format |
query_execution | :public-csv-download / etc. |
- User-Agent sanitization: Current decision is to store raw and display safely. No stripping needed unless injection risk is identified.
- embedding_path truncation: Strip query params first, then truncate to 2048. Implementation detail for Ticket 5/6.
- Unknown X-Metabase-Client values: Pass through as-is in the surface CASE expression (the ELSE clause).
- Auth identity reliability:
core_session.auth_identity_idreliably points to the auth identity used to create that session (set at creation time increate-session-with-auth-tracking!). Safe to LEFT JOIN for auth_method. - Tenant context for unauthenticated requests: NULL tenant_id for public/embed requests is fine.
- No backfill: Don't backfill auth_method or tenant_id for historical rows.
- OIDC provider granularity: No need to distinguish between multiple OIDC providers. Raw "oidc" string is sufficient.
- Index strategy: Out of scope for this project. Only the planned tenant_id indexes.
Decision: Keep the -preview suffix on embedding_client values.
Context: The embedding-mw appends -preview to the client value when X-Metabase-Embedded-Preview: true is set (e.g., embedding-sdk-react becomes embedding-sdk-react-preview). The question was whether preview requests should be grouped with their non-preview counterpart or kept distinct.
Resolution (Poom + Nicolo):
- Preview requests must be distinguished from production usage for auditing - admins can query sensitive data via the embed preview wizard, and this should appear in audit trails (Nicolo)
- Usage analytics views (Ticket 7) should de-emphasize preview but still surface it - the main use case is admins viewing embedded user behavior, preview is the edge case (Poom)
- Preview requests should NOT be ignored/skipped entirely because of the auditing concern
Implementation: Two TODOs in sdk.clj resolved with comments documenting this decision. No behavioral changes - the -preview suffix was already being appended.
Rather than building all inputs, then all DB writes, then all views in sequence, we wire each value end-to-end (request -> DB table -> analytics view) to prove the pipeline works, then repeat for each additional value. This catches integration issues early.
Values dictated by 1 input:
| Value | Input Source | DB Column(s) | View | Status |
|---|---|---|---|---|
| embedding_client | Route prefix OR X-Metabase-Client header | view_log + query_execution | v_query_log (needs surface CASE) | Wired - middleware binds, tests verify DB values |
| embedding_version | X-Metabase-Client-Version header | view_log + query_execution | v_query_log | Wired |
| preview suffix | X-Metabase-Embedded-Preview header | appended to embedding_client | same as above | Wired - tested |
| tenant_id | current-user tenant | view_log + query_execution | needs view col | Migration done, not yet populated |
| embedding_hostname | Origin header | view_log + query_execution | needs view col | Extraction fn done (pii-request-info), not wired to writes |
| embedding_path | Referer header (path only) | view_log + query_execution | needs view col | Extraction fn done, not wired to writes |
| sanitized_user_agent | User-Agent header | view_log + query_execution | needs view col | Extraction fn done, not wired to writes |
| ip_address | Request remote addr | view_log + query_execution | needs view col | Extraction fn done, not wired to writes |
| is_impersonated | Connection impersonation flag (Category B - QP result metadata) | query_execution only | needs view col | Migration done, not yet populated |
| is_db_routed | DB routing flag (Category B - QP result metadata) | query_execution only | needs view col | Migration done, not yet populated |
| parameters | Query filter values (JSON) | query_execution only | needs view col | Migration done, not yet populated |
Values dictated by multiple inputs:
| Value | Input Sources | DB Column | Status |
|---|---|---|---|
| auth_method | Session type (auth_identity.provider) + route (public, guest-embed) + API key header | view_log + query_execution | Dynamic var + include-sdk-info done, not yet bound in auth middleware |
Where auth method is already known: At session creation time via auth_identity.provider. The core_session table has auth_identity_id FK to auth_identity, which has a provider column.
Recommended binding approach: Derive at request time by LEFT JOINing auth_identity in session-with-id-query (session.clj:99-140). This adds provider to every session lookup with zero extra queries (indexed FK lookup on a single row).
- Modify
session-with-id-queryto add[:auth_identity.provider :auth-method]to SELECT - For API keys: hardcode
"api-key"incurrent-user-info-for-api-key - For public/embed routes: bind based on route prefix in
embedding-mwalongside*client* - Bind
*auth-method*inbind-current-userfrom the request map
view_log write path: Request -> event published -> view_log/events/view_log.clj:generate-view -> record-views! -> t2/insert! -> before-insert hook calls analytics/include-sdk-info
query_execution write path: process-userland-query-middleware -> save-execution-metadata! -> include-sdk-info called at line 56 (BEFORE async dispatch) -> async agent thread does t2/insert-returning-pk!
tenant_id: Available via (:tenant_id @api/*current-user*). No *current-tenant* var. Must capture before async boundary.
PII fields: Best bound in embedding-mw alongside *client* and *version*. No need for separate middleware - same request, same tables, same dynamic-var-to-DB-write pattern. PII columns are named embedding_* so they're embedding-scoped.
is_impersonated / is_db_routed: Category B - set inside QP pipeline on result metadata. Must follow is_sandboxed pattern. perms/impersonated-user? exists for impersonation detection. is_db_routed has no existing flag - needs to be surfaced from routing middleware.
View upgrade process: Create new version directory under resources/migrations/instance_analytics_views/<view_name>/v<N>/ with postgres/mysql/h2 SQL files. Add migration changeset to 060/ YAML. Rollback section points to previous version's SQL files.
Surface CASE expression: Goes directly in the SQL view definition (no separate derived view). Consistent with how views already do column renaming, type casting, and coalescing. No precedent for layered views.
PII in views: Include columns in views, gate collection at write time (Option A). SQL views can't conditionally include columns based on runtime settings. The pii-request-info function already handles the extraction; the GDPR setting gates whether it's called.
Two separate analytics systems:
internal_stats/query_executions.cljfeeds Snowplow/phone-home telemetry (aggregate counts sent to Metabase HQ)- Analytics views (
v_query_log,v_view_log) feed instance-level usage dashboards for admins
These should use consistent classification logic but serve different purposes. The query_executions.clj CASE expressions may also need updating for newer client values (public, guest-embed, metabot, agent-api).
| Auth Method | Test Infrastructure | Feasibility |
|---|---|---|
| password | mt/user-http-request - standard test helper |
Easy |
| API key | Tests in api_keys/api_test.clj, header-based |
Easy |
| LDAP | with-ldap-server! macro - in-memory LDAP server |
Feasible |
| SAML | with-saml-default-setup! macro, test cert at test_resources/sso/auth0-public-idp.cert |
Feasible - mature infra |
| JWT | with-jwt-default-setup! macro (random secret per run) |
Feasible - mature infra |
| Google OAuth test setup exists | Feasible | |
| OIDC | Enterprise OIDC test setup | Feasible |
| public | Existing view_log tests verify embedding_client "public" in DB |
Easy |
| guest-embed | Existing embed test helpers | Easy |
Testing approach: Use existing SSO test infra to perform login, make subsequent request with session, assert auth_method written to view_log/query_execution (or captured via test helper).
- auth_method for password + API key - bind
*auth-method*in session middleware, verify it lands in view_log/query_execution. Easiest to test, proves the full pipeline. - PII fields - wire
pii-request-infointo view_log/query_execution write path (EMB-1504/1505), GDPR-gated - tenant_id - bind from current user context, verify in both tables
-
Should
parametersbe GDPR-gated? Filter parameter values can contain user-identifiable information (e.g., filtering by email, customer name). Should theparameterscolumn on query_execution also be gated by thecollect-analytics-piisetting? Or are parameter values considered non-PII? -
Embedding client precedence.RESOLVED: The endpoint's server-side binding wins (unconditionalwith-client!). The endpoint determines which surface the request is actually served through, regardless of what client header was sent. -
Parameters column redundancy. The full query (including parameters) is already stored as
json_queryon query_execution. Storing parameters separately adds queryability but also doubles that data. Worth it? -
NULL auth_method semantics. When
auth_methodis NULL (old sessions before auth_identity migration, background jobs, pulse rendering), should analytics treat this as "unknown" or "internal"? Should we explicitly bind "internal" for background jobs? -
RESOLVED: Safe.as-adminescalation.as-admincallsdo-with-current-userwhich only rebinds the 7 vars it explicitly names (*current-user-id*,*is-superuser?*, etc.). A separately-bound*auth-method*(in middleware, outsidedo-with-current-user) is NOT touched byas-admin- Clojurebindingonly shadows explicitly named vars. The original auth method is correctly retained.
- Setting name: Use
embedding-analytics-collect-pii(notembedded-) to match existingembedding-*prefix convention inembedding/settings.clj.
OUTCOME For this: we actually should just call it collect-analytics-pii since it is not tied to embedding. see:
[3:16 PM]Bryan Question about setting name PII: Our product/tech docs have it called embedded-analytics-collect-pii.
1. embedding-analytics-collect-pii (not embedded-) better matches the existing embedding-* prefix convention in embedding/settings.clj
but the bigger thing is: is PII collection actually embedding specific? Should we capture that info for non-embedding requests too? if so we will change the name even more
[7:35 AM]roman I think this isn’t embedding-specific
-
Feature gate: AddOVERRIDDEN: No feature gate. Setting is not embedding-specific (per Roman), available to all instances.:feature :embeddingfrom the start. -
Auth method values - use raw provider strings: Don't map "password" -> "cookie". Use the raw
auth_identity.providervalue directly. This gives us "password", "google", "ldap", "oidc", "custom-oidc", "slack-connect", "support-access-grant" for free without a mapping table. The provider describes how the user proved their identity - the transport mechanism (cookie vs header) is orthogonal. -
embedding_client binding precedence: UseOVERRIDDEN: Use unconditional(or *client* "public")/(or *client* "guest-embed")instead of unconditional rebinding.with-client!binding. The endpoint's server-side value wins on conflicts - the endpoint determines which surface the request is served through. -
Guest/Public binding location: Create thin wrapper functions in
routes.clj(e.g.,with-analytics-bindings) rather than modifying security-oriented wrappers (+public-exceptions,+message-only-exceptions). Keeps analytics concerns separate from security concerns. -
Drop
active_user_countfrom v_tenants: It breaks the pattern of analytics views exposing raw data. The join can be done in the dashboard layer using thetenant_idbeing added tov_users. No other analytics view computes aggregates inline. -
Use
${text.type}nottext: The YAML should use${text.type}for the parameters column (maps to TEXT/LONGTEXT/CLOB per database). Same pattern aslens_params. -
View rollback correctness: v_view_log rollback is tricky - current version is v2 for postgres/h2 but v3 for mysql. Rollbacks must reference the correct per-database previous version.
- Session query JOIN performance: Adding LEFT JOIN to
auth_identityis negligible - PK lookup on one row, both sides indexed. - Impersonation detection:
(boolean (:impersonation/role query))is reliable - key only set when impersonation genuinely applied. Database routing detection:CORRECTED:(boolean destination-database-id)is reliable.is_db_routedhas the same timing issue asis_impersonated- must use result-metadata pattern. See Critical Issue #3 above.- Tenant table on OSS: Table exists on all instances (not EE-gated). v_tenants view will work everywhere (zero rows on OSS).
(:tenant_id @api/*current-user*)without defenterprise: Works fine - column is indefault-user-columns, naturally NULL on OSS.- Qualified ID format:
'tenant_' || tenant_idis consistent with all other views.tenant_collection_idis NOT NULL so no nullable concat concern. - v_tenants in audit-db-view-names: Safe to add even before view is synced - permission checks simply ignore unknown tables.
- Scheduled tasks (pulses):
*auth-method*will be nil/NULL for pulse queries. Combined withcontext = 'pulse', this is unambiguous and correct. - Auth method in analytics views: Low security risk - only visible to admins who already have access to login history and user details.
-
CRITICAL: Two thread boundaries lose dynamic vars for query_execution. The streaming response body runs on a different thread (streaming thread pool) from the Ring request thread, and
save-execution-metadata!then submits work to yet another thread (pooled executor). Dynamic vars like*current-user*,*auth-method*, and(current-request)are NOT available on either thread.Fix: All new column values must be captured as data in the query
:infomap BEFORE entering the streaming response - following the existingexecuted-bypattern. Do NOT read dynamic vars inquery-execution-info. Instead:- Best injection point:
qp/userland-queryinsrc/metabase/query_processor.clj(lines 101-108) - the single centralized function ALL userland query entry points call - Add a helper
embedding-analytics-infothat reads dynamic vars on the Ring thread and returns a map of:auth-method,:tenant_id, and GDPR fields - Merge this into
:infoinsideuserland-query - Read them from
(:info query)inquery-execution-info - Also add these keys to the
::lib.schema.info/infoMalli schema insrc/metabase/lib/schema/info.cljc
- Best injection point:
-
CRITICAL:
is_impersonatedwill always be false if set inquery-execution-info. The:impersonation/rolekey is set by EE impersonation middleware which runs INSIDE the QP pipeline, AFTERprocess-userland-query-middlewarehas already capturedquery-execution-info.Fix: Follow the
is_sandboxedpattern - have the impersonation middleware set the value on result metadata, then extract it inadd-and-save-execution-metadata-xform!alongsideis_sandboxed. -
CRITICAL:
is_db_routedhas the same timing issue asis_impersonated.:destination-database/idis set byattach-destination-db-middlewareduring preprocessing, which runs INSIDE the QP pipeline, AFTERquery-execution-infocaptures data. The original report incorrectly stated this was set before the pipeline.Fix: Same as
is_impersonated- set:is_db_routedon result metadata in theswap-destination-dbexecution middleware (database_routing/middleware.clj), then extract inadd-and-save-execution-metadata-xform!.
- GDPR data retention: Once PII is collected and the setting is later disabled, historical PII remains visible through the views. The setting only controls future writes.
OUTCOME: Inline columns, purge deferred. All PII columns are nullable, so future purge = batched UPDATE SET column = NULL. Costly on large tables (tens of millions of rows) but acceptable since PII collection is off by default this phase. Existing PII precedent: login_history stores ip_address and device_description inline with no cleanup. Separate FK'd PII tables (purge = TRUNCATE, sub-second) were considered but deferred. Slack consensus: setting stays off this phase; Roman says don't paint into a corner; Ben confirms nullable columns = no technical limitation to future deletion.
Slack thread for reference:
[11:11 AM]Bryan Something we should be aware of for GDPR: Once PII is collected and the setting is later disabled, historical PII will remain visible through the views. The setting only controls future writes. Is that something we should handle as part of this project?
[11:15 AM]sameer good point. I Think flipping the bit should be considered a pretty serious action.
For actual compliance, we'd need to re-write history and scrub PII[11:46 AM]roman How difficult is it to delete this data when toggling off? Straightforward, I guess, if so let's do it for the first iteration
[11:48 AM]sameer depends on how many views there are
[4:39 PM]Ben Grabow My understanding is that for this phase of the project we're building under the assumption that the setting will be turned off (by default off and remains off). At some later point we'll start another project/phase of project where we start turning the setting on, and we'll need to tackle the GDPR issues then. Not saying we shouldn't think about them but pragmatically speaking I don't think we need to address these concerns in the branch we're working on this week. (edited)
[7:04 AM]roman @Ben Grabow here it feels more important not to rush into building something, but to make sure we’re not doing anything that would make it hard to remove this data later. So if we have a plan and a clear understanding of what to do, that’s fine too.
So ideally we’d want to answer the hardest questions first, rather than getting stuck on them at the end[8:35 AM]Ben Grabow I agree, we shouldn't paint ourselves into a corner and we should make sure that historical PII can be removed. I just wanted to clarify the scope of this phase of the project.
When we get to the phase when we are building a way to remove historical PII, it seems that we know which columns will contain PII, we will make those columns nullable, so in principle there is no technical limitation to deleting the historical PII.
There may be some UX and performance questions to answer. If we put the PII columns in their own table FK'd to the main tables, then deleting the PII would be as simple as dropping the table. Do we have precedent for storing PII elsewhere in the system or is this the first time?
- Legacy sessions:
auth_identity_idcan be NULL oncore_sessionfor pre-migration sessions. LEFT JOIN handles this (returns NULL provider). Analytics should treat NULL auth_method from session requests as "unknown/legacy". - Index creation on large tables:
CREATE INDEXtakes a SHARE lock on Postgres which blocks writes. For very largeview_log/query_executiontables in production, considerCREATE INDEX CONCURRENTLY(requires raw SQL changeset for Postgres, standardcreateIndexfor MySQL/H2).