jordotech/admin-feature-ENG-857-no-celery-design-20260326-134155.md

Design: Application-Level S3 Encryption for EY Environments

Generated by /office-hours on 2026-03-26 Branch: feature/ENG-857-no-celery Repo: Faction-V/gofigure_terraform Status: APPROVED Mode: Intrapreneurship

Problem Statement

EY's InfoSec team requires that files uploaded in EY environments are stored as ciphertext in S3, decryptable only by application code holding a key that lives outside AWS. The specific threat: a Capitol AI admin with AWS console/CLI access could bypass the application layer, use aws s3 cp to download files, and since they have kms:Decrypt permission on the XKS-backed KMS key, read plaintext. EY rejected KMS key policy lockdown — they want encryption guarantees independent of AWS IAM entirely.

Capitol AI already has two security layers deployed:

XKS encryption (S3 -> KMS -> XKS Proxy -> EY's Azure Key Vault) — EY controls the key, can revoke and audit
Per-org IAM isolation (ENG-893) — email-domain-scoped S3 access prevents Capitol admins from accessing EY files through the application

App-level encryption would be the third layer, ensuring S3 stores only ciphertext even if the entire AWS permission model is bypassed.

Demand Evidence

EY is a paying enterprise customer with three dedicated workspaces (ey-eu-west-1, ey-ap-southeast-1, ey-us-east-2). Their InfoSec team explicitly requested this capability. Chester (Capitol CTO) confirmed the requirement on a call. KMS key policy lockdown was proposed and rejected — EY wants the ciphertext guarantee regardless of AWS permissions.

Status Quo

Today, EY files are encrypted via SSE-KMS with XKS (key material in Azure Key Vault). This is transparent encryption — any IAM principal with s3:GetObject + kms:Decrypt permissions reads plaintext. The per-org IAM isolation (ENG-893) blocks the application's default IRSA role from accessing EY prefixes, but does not block Capitol admin IAM users with direct AWS access.

Target User & Narrowest Wedge

Target: EY InfoSec team evaluating Capitol AI's data protection posture.

Narrowest wedge: Encrypt uploads in one EY workspace (e.g., ey-eu-west-1) with the envelope encryption library. Validate with EY InfoSec that downloading a raw S3 object via aws s3 cp returns ciphertext. Then expand to all three EY workspaces and all three bucket types.

Constraints

Three S3 buckets in scope: uploads (capitol-ai-ingestion-pipeline-*), workflow files (capai-agentic-files-*), outputs (capai-agentic-outputs-*)
Three services touch EY's S3 files: platform-api, qdrant-svc, agentic-backend (platform-ingestion-pipeline creates the bucket but file processing is Celery-based within these services, not Lambda)
Pre-signed URLs break (return ciphertext) — proxy download endpoints needed
Reducto.ai integration changes — qdrant-svc must decrypt in-memory before sending to Reducto
EY-only requirement — conditional per-org, not all customers
Existing XKS + IAM isolation remain as additional defense layers
Chester's constraint: "our ability to support and troubleshoot will be limited, so the onus will be on them to look at a lot of issues themselves"

Premises

S3 objects must be ciphertext at rest, decryptable only by application code holding a key outside AWS (Azure Key Vault)
Three services modified: platform-api, qdrant-svc, agentic-backend. All three S3 buckets in scope. File ingestion is Celery-based (no Lambda).
Pre-signed URLs break — need proxy download endpoints (latency + memory impact for large files)
Reducto.ai integration changes — qdrant-svc must decrypt in-memory and stream to Reducto
EY-only requirement — conditional per-org/per-workspace
Existing XKS + IAM isolation remain as additional defense layers

Cross-Model Perspective

Codex independently reviewed the problem and provided these insights:

Steelman: "An org-scoped data protection layer where EY's files are always encrypted with keys that never live in AWS, so even a privileged Capitol admin cannot decrypt raw S3 objects. It preserves existing product functionality by moving decryption into controlled application paths and streams."

Key insight: The quote "Pre-signed URLs break — they return ciphertext. Need proxy download endpoint" reveals the real build: a new data access plane (decrypting proxy/streaming service) that replaces direct S3 access for every read path. This is the core product change and cost driver.

Challenged premise: Codex questioned whether only 3 services touch EY's S3 files. Investigation confirmed a 4th: platform-ingestion-pipeline creates and manages the uploads bucket for EY workspaces.

48-hour prototype: Minimal FastAPI endpoints for upload (stream-encrypt with AES-256-GCM, wrap per-object DEK via Azure Key Vault, store wrapped key + IV in S3 metadata) and download (stream-decrypt to client). Gate by workspace/env var; one bucket only. Skip browser preview UX, key rotation, multipart upload optimization, bulk re-encryption.

Approaches Considered

Approach A: Envelope Encryption Library (Shared Python Package) — CHOSEN

Build a shared Python library (capitol-crypto) that wraps S3 read/write with AES-256-GCM envelope encryption. Each file gets a random Data Encryption Key (DEK), which is wrapped by Azure Key Vault. Wrapped DEK + IV stored in S3 object metadata. All 3 services import the library. Pre-signed URLs replaced with proxy download endpoints.

Effort: L (human: 4-6 weeks / CC+gstack: 3-5 days)
Risk: High
Pros: Cleanest architecture, S3 stores pure ciphertext, key never in AWS, per-object keys
Cons: Breaks pre-signed URLs, changes Reducto integration, key rotation requires re-encryption

Approach B: Encryption Sidecar / Proxy Service

Deploy a standalone "crypto-proxy" Kubernetes service between all services and S3. Services swap S3 endpoints to the proxy. Same envelope encryption but centralized.

Effort: XL (human: 6-8 weeks / CC+gstack: 5-7 days)
Risk: Very High — new single point of failure, double network hop
Pros: Minimal per-service code changes, centralized audit logging
Cons: New service to maintain, latency for every S3 op, still breaks pre-signed URLs

Approach C: Scoped KMS Key Policy + S3 Bucket Policy (Rejected by EY)

Lock down XKS KMS key policy to only allow service roles to call kms:Decrypt. Zero app code changes.

Effort: S (human: 2-3 days / CC+gstack: 2-4 hours)
Risk: Low
Pros: Zero app changes, pre-signed URLs work, no latency impact
Cons: EY already rejected — still depends on AWS IAM, doesn't satisfy "ciphertext independent of AWS" requirement

Recommended Approach

Approach A: Envelope Encryption Library

Technical Design

Encryption Scheme

WRITE PATH:
1. Generate random 256-bit DEK (Data Encryption Key)
2. Encrypt file content with AES-256-GCM using DEK
3. Call Azure Key Vault to wrap DEK with org's master key (KEK)
4. Store to S3:
   - Body: ciphertext
   - Metadata: wrapped_dek, iv, auth_tag, kek_version, encryption_version

READ PATH:
1. Read S3 object (ciphertext + metadata)
2. Call Azure Key Vault to unwrap DEK using org's KEK
3. Decrypt ciphertext with AES-256-GCM using DEK + IV + auth_tag
4. Return plaintext to caller

Envelope Encryption Rationale

Per-object DEKs: Each file has a unique key. Compromising one DEK only exposes one file.
KEK in Azure Key Vault: The Key Encryption Key (master key) never leaves Azure. EY controls it.
Wrapped DEK in S3 metadata: The DEK is stored encrypted alongside the object. Useless without Azure Key Vault access.
AES-256-GCM with chunked streaming: Standard AES-GCM requires the full ciphertext for auth tag verification. For files >50MB, use a chunked scheme: split the file into fixed-size chunks (e.g., 1MB), each encrypted with its own GCM nonce derived from the DEK + chunk index. Each chunk has its own auth tag stored in a manifest. This follows the STREAM construction pattern (see AWS S3 CSE v3 spec for reference implementation). Small files (<50MB) can use single-shot GCM. Exact chunk format details (manifest storage, nonce derivation function) will be finalized during the library build phase.

DEK Caching

To avoid an Azure Key Vault API call on every S3 read, cache unwrapped DEKs in-memory with a short TTL (5 minutes). The cache is keyed by (bucket, s3_key, wrapped_dek_hash). Cache invalidation happens on TTL expiry or process restart. This is acceptable because the DEK is already stored (wrapped) in S3 metadata — caching the unwrapped version for a short window reduces AKV latency from ~100-500ms per read to near-zero for repeated reads of the same file.

Library Interface (capitol-crypto)

# capitol_crypto/s3.py

class EncryptedS3Client:
    """Drop-in replacement for S3 operations with envelope encryption."""

    def __init__(self, s3_client, azure_kv_client, kek_name: str):
        ...

    async def put_object(self, bucket: str, key: str, body: bytes, **kwargs) -> dict:
        """Encrypt body with fresh DEK, wrap DEK via Azure KV, upload to S3."""
        ...

    async def get_object(self, bucket: str, key: str, **kwargs) -> bytes:
        """Download from S3, unwrap DEK via Azure KV, decrypt body."""
        ...

    async def get_object_stream(self, bucket: str, key: str) -> AsyncIterator[bytes]:
        """Streaming decrypt for large files (avoids loading entire file in memory)."""
        ...

    async def generate_download_url(self, bucket: str, key: str, expires_in: int) -> str:
        """Generate a signed proxy URL (NOT S3 pre-signed URL).

        Returns a URL like: /api/v1/files/download?token=<JWT>
        The JWT contains: bucket, key, exp (expiry), iat (issued at),
        signed with a service-level HMAC secret from SSM Parameter Store.
        The download endpoint validates the JWT signature and expiry
        before streaming decrypted content.
        """
        ...

Encryption Decision Flow

For every S3 operation, the service resolves whether to use the encrypted client:

S3 operation requested
  │
  ├─ Resolve org_id from request context
  │   (platform-api: from auth token → user → org membership)
  │   (agentic-backend: from workflow record → org_id)
  │   (qdrant-svc: from collection record → org_id)
  │   (ingestion-pipeline: from S3 event key prefix → org_id)
  │
  ├─ Look up org record in DynamoDB table:
  │   <workspace>-<client>-organizations_v1
  │
  ├─ Check org.encryption_enabled
  │   ├─ false (or absent) → Use standard S3 client (no change)
  │   └─ true → Use EncryptedS3Client with org.encryption_kek_name
  │             and org.encryption_kek_vault_url
  │
  └─ Proceed with S3 operation

Per-Service Changes

platform-api (uploads/downloads):

Replace s3_client.put_object() with encrypted_s3.put_object() for EY orgs
Replace pre-signed URL generation with proxy download endpoint
New endpoint: GET /api/v1/files/{file_id}/download — streams decrypted content
Conditional: check org.encryption_enabled flag before using encrypted client

qdrant-svc (parsing pipeline):

Before sending to Reducto: decrypt S3 object in-memory
Stream decrypted bytes to Reducto API (HTTP upload, not S3 path)
May need to switch Reducto integration from "S3 path" mode to "file upload" mode

agentic-backend (workflow files + outputs):

Replace S3 read/write with encrypted client for EY orgs
Replace pre-signed URLs for file preview with proxy download
Worker pods need Azure Key Vault credentials
Also calls Reducto.ai for document parsing — needs same decrypt-before-upload treatment as qdrant-svc

platform-ingestion-pipeline — NO CHANGES NEEDED:

File ingestion is now 100% Celery-based (runs within the existing EKS services), not Lambda
The platform-ingestion-pipeline Terraform module creates the S3 bucket but does not process files
All file processing happens in platform-api and qdrant-svc Celery workers, which are already covered above

Infrastructure Changes (Terraform)

Azure Key Vault: New key per EY org (or reuse XKS wrapping key)
IAM: Service roles need permission to call Azure Key Vault API (network path)
Kubernetes: Azure KV credentials as Kubernetes secrets or via Workload Identity
DynamoDB: New field encryption_enabled on organization record
No Lambda changes needed (ingestion is Celery-based)

Data Model

New fields on the organizations DynamoDB table (DynamoDB is schemaless for non-key attributes, so no Terraform table definition changes needed — fields are added at the application layer):

Field	Type	Description
`encryption_enabled`	`Boolean`	Enable app-level encryption for this org
`encryption_kek_name`	`String`	Azure Key Vault key name for this org's KEK
`encryption_kek_vault_url`	`String`	Azure Key Vault URL

Migration Strategy

New files: Encrypted on write. No migration needed.

Existing files: Must be re-encrypted in a batch job:

Read each object (decrypted via XKS transparently)
Re-encrypt with envelope encryption
Overwrite in S3 with ciphertext + metadata
Record migrated objects in a checkpoint DynamoDB table for idempotency

Estimated scope: depends on EY's current data volume. Can be a one-time batch job.

Rollout Sequence (three EY workspaces, sequential):

ey-eu-west-1 — first, as the narrowest wedge. Validate with EY InfoSec.
ey-ap-southeast-1 — second, after EU validation confirms the approach works.
ey-us-east-2 — last.

Each workspace has its own DynamoDB org table (<workspace>-ey-organizations_v1), S3 buckets, and service deployments. Migration batch job runs per-workspace. Enable encryption_enabled per-org after migration completes for that workspace.

Rollback Plan

If app-level encryption causes production issues after deployment:

Immediate: Disable encryption_enabled flag on the org's DynamoDB record. Services fall back to standard S3 client (reads will fail for already-encrypted objects).
Data rollback: Run reverse batch job — read encrypted objects via EncryptedS3Client, re-upload as plaintext (SSE-KMS with XKS takes over). The migration checkpoint table tracks which objects were encrypted, enabling targeted rollback.
Code rollback: Revert service deployments via kubectl rollout undo. The library is a dependency — removing it requires a code revert + redeploy.

The migration batch job is designed to be idempotent: re-running it skips already-migrated objects (checked via S3 metadata encryption_version field).

Failure Modes

Failure	Impact	Mitigation
Azure Key Vault unreachable	All encrypted file reads/writes fail for that org	Circuit breaker with 3 retries + exponential backoff. Return HTTP 503 "Encryption service temporarily unavailable" to user. Alert via Sentry.
Partial write (file uploaded, DEK wrap fails)	Orphaned plaintext object in S3	Wrap DEK BEFORE uploading ciphertext. If wrap fails, abort before S3 write. Never store plaintext.
EY revokes KEK in Azure Key Vault	All EY files become permanently unreadable	Expected behavior — this is EY's kill switch. Document this clearly for EY. Alert Capitol ops team.
DEK cache stale after key rotation	Reads succeed with old cached DEK (still valid for existing files)	KEK rotation only affects new DEKs. Old wrapped DEKs still unwrap with old KEK version if AKV retains it. Cache TTL ensures refresh within 5 minutes.

Pre-signed URL Replacement

Current flow:

Browser → pre-signed S3 URL → S3 (returns plaintext via KMS)

New flow:

Browser → platform-api proxy endpoint → decrypt in pod → stream to browser

Impact:

Latency increase: ~100-500ms per file (Azure KV call + decrypt)
Memory: must stream, not buffer (important for files >50MB)
Pod resource limits may need increase for download-heavy workloads

Open Questions

Azure Key Vault access from EKS: Do the service pods have network access to Azure Key Vault? The XKS proxy EC2 instance does, but EKS pods may need a new network path.
Reducto.ai integration mode: Does qdrant-svc currently pass S3 paths to Reducto, or upload file bytes? If S3 paths, this is a bigger change.
~~Lambda cold start impact~~: Not applicable — ingestion is Celery-based, no Lambda.
Key rotation: When EY rotates their KEK in Azure Key Vault, do we re-encrypt all DEKs? Or version the KEK and handle multi-version unwrap?
File size limits: What's the largest file EY uploads? Streaming encrypt/decrypt has different complexity at 10MB vs 1GB.
Existing data volume: How many objects exist in EY buckets today? Determines migration batch job duration.
EY acceptance criteria: Will EY InfoSec want to verify by running aws s3 cp themselves and seeing ciphertext? Need to coordinate testing.
Azure Key Vault region for APAC: Which AKV region will EY use? If the vault is in Europe/US, cross-Pacific latency from ey-ap-southeast-1 could exceed 500ms per unique file read. EY may need a regional AKV instance, or the DEK cache TTL should be extended for APAC.

Success Criteria

aws s3 cp on any EY org file returns ciphertext (not readable content)
Downloading the same file through the Capitol AI platform returns readable content (for authorized users with matching email domain)
File upload, download, preview, and Reducto parsing all work correctly for EY orgs
Non-EY workspaces are completely unaffected
EY InfoSec signs off on the implementation

Distribution Plan

capitol-crypto library: Published to Capitol AI's private CodeArtifact repository
Consumed by: platform-api, qdrant-svc, agentic-backend as a pip dependency
Infrastructure: Terraform modules for Azure KV keys, IAM policies, Kubernetes secrets

Dependencies

Azure Key Vault API accessible from EKS pods (network path)
Reducto.ai supports file upload mode (not just S3 paths)
EY provides/approves a KEK in their Azure Key Vault
No Lambda dependencies (ingestion is Celery-based in EKS)

LOE Summary

Component	CC+gstack Estimate	Human Estimate	Notes
`capitol-crypto` library	4-6 hours	1 week	Core encrypt/decrypt + streaming + tests
platform-api changes	4-6 hours	1 week	Proxy download endpoint, upload encryption
qdrant-svc changes	3-4 hours	1 week	Decrypt before Reducto, integration mode change
agentic-backend changes	3-4 hours	3-4 days	Replace S3 ops + pre-signed URLs
Terraform (infra)	2-3 hours	2-3 days	Azure KV keys, IAM, K8s secrets, DynamoDB fields
Migration batch job	2-3 hours	2-3 days	Re-encrypt existing EY files
Testing & validation	4-6 hours	1 week	End-to-end across all 3 services
TOTAL	~2-4 days	~3-5 weeks

The Assignment

Before committing to building this, re-present Approach C (KMS key policy lockdown) to EY one more time with this specific framing:

"We can configure the encryption key policy so that Capitol admin IAM users cannot decrypt your files — even downloading via the AWS console returns encrypted bytes. Your Azure Key Vault logs every single decryption call. The only principals that can decrypt are the application service roles that serve your authorized users. This is enforced at the AWS KMS level, not just application code."

If EY still rejects it after hearing this framing, then proceed with Approach A. The design doc above gives you the full implementation plan and LOE.