Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save jordotech/bd20e64728cbaf25773cae78ea924127 to your computer and use it in GitHub Desktop.

Select an option

Save jordotech/bd20e64728cbaf25773cae78ea924127 to your computer and use it in GitHub Desktop.
Design: Application-Level S3 Encryption for EY Environments — LOE & Technical Design

Design: Application-Level S3 Encryption for EY Environments

Generated by /office-hours on 2026-03-26 Branch: feature/ENG-857-no-celery Repo: Faction-V/gofigure_terraform Status: APPROVED Mode: Intrapreneurship

Problem Statement

EY's InfoSec team requires that files uploaded in EY environments are stored as ciphertext in S3, decryptable only by application code holding a key that lives outside AWS. The specific threat: a Capitol AI admin with AWS console/CLI access could bypass the application layer, use aws s3 cp to download files, and since they have kms:Decrypt permission on the XKS-backed KMS key, read plaintext. EY rejected KMS key policy lockdown — they want encryption guarantees independent of AWS IAM entirely.

Capitol AI already has two security layers deployed:

  1. XKS encryption (S3 -> KMS -> XKS Proxy -> EY's Azure Key Vault) — EY controls the key, can revoke and audit
  2. Per-org IAM isolation (ENG-893) — email-domain-scoped S3 access prevents Capitol admins from accessing EY files through the application

App-level encryption would be the third layer, ensuring S3 stores only ciphertext even if the entire AWS permission model is bypassed.

Demand Evidence

EY is a paying enterprise customer with three dedicated workspaces (ey-eu-west-1, ey-ap-southeast-1, ey-us-east-2). Their InfoSec team explicitly requested this capability. Chester (Capitol CTO) confirmed the requirement on a call. KMS key policy lockdown was proposed and rejected — EY wants the ciphertext guarantee regardless of AWS permissions.

Status Quo

Today, EY files are encrypted via SSE-KMS with XKS (key material in Azure Key Vault). This is transparent encryption — any IAM principal with s3:GetObject + kms:Decrypt permissions reads plaintext. The per-org IAM isolation (ENG-893) blocks the application's default IRSA role from accessing EY prefixes, but does not block Capitol admin IAM users with direct AWS access.

Target User & Narrowest Wedge

Target: EY InfoSec team evaluating Capitol AI's data protection posture.

Narrowest wedge: Encrypt uploads in one EY workspace (e.g., ey-eu-west-1) with the envelope encryption library. Validate with EY InfoSec that downloading a raw S3 object via aws s3 cp returns ciphertext. Then expand to all three EY workspaces and all three bucket types.

Constraints

  • Three S3 buckets in scope: uploads (capitol-ai-ingestion-pipeline-*), workflow files (capai-agentic-files-*), outputs (capai-agentic-outputs-*)
  • Three services touch EY's S3 files: platform-api, qdrant-svc, agentic-backend (platform-ingestion-pipeline creates the bucket but file processing is Celery-based within these services, not Lambda)
  • Pre-signed URLs break (return ciphertext) — proxy download endpoints needed
  • Reducto.ai integration changes — qdrant-svc must decrypt in-memory before sending to Reducto
  • EY-only requirement — conditional per-org, not all customers
  • Existing XKS + IAM isolation remain as additional defense layers
  • Chester's constraint: "our ability to support and troubleshoot will be limited, so the onus will be on them to look at a lot of issues themselves"

Premises

  1. S3 objects must be ciphertext at rest, decryptable only by application code holding a key outside AWS (Azure Key Vault)
  2. Three services modified: platform-api, qdrant-svc, agentic-backend. All three S3 buckets in scope. File ingestion is Celery-based (no Lambda).
  3. Pre-signed URLs break — need proxy download endpoints (latency + memory impact for large files)
  4. Reducto.ai integration changes — qdrant-svc must decrypt in-memory and stream to Reducto
  5. EY-only requirement — conditional per-org/per-workspace
  6. Existing XKS + IAM isolation remain as additional defense layers

Cross-Model Perspective

Codex independently reviewed the problem and provided these insights:

Steelman: "An org-scoped data protection layer where EY's files are always encrypted with keys that never live in AWS, so even a privileged Capitol admin cannot decrypt raw S3 objects. It preserves existing product functionality by moving decryption into controlled application paths and streams."

Key insight: The quote "Pre-signed URLs break — they return ciphertext. Need proxy download endpoint" reveals the real build: a new data access plane (decrypting proxy/streaming service) that replaces direct S3 access for every read path. This is the core product change and cost driver.

Challenged premise: Codex questioned whether only 3 services touch EY's S3 files. Investigation confirmed a 4th: platform-ingestion-pipeline creates and manages the uploads bucket for EY workspaces.

48-hour prototype: Minimal FastAPI endpoints for upload (stream-encrypt with AES-256-GCM, wrap per-object DEK via Azure Key Vault, store wrapped key + IV in S3 metadata) and download (stream-decrypt to client). Gate by workspace/env var; one bucket only. Skip browser preview UX, key rotation, multipart upload optimization, bulk re-encryption.

Approaches Considered

Approach A: Envelope Encryption Library (Shared Python Package) — CHOSEN

Build a shared Python library (capitol-crypto) that wraps S3 read/write with AES-256-GCM envelope encryption. Each file gets a random Data Encryption Key (DEK), which is wrapped by Azure Key Vault. Wrapped DEK + IV stored in S3 object metadata. All 3 services import the library. Pre-signed URLs replaced with proxy download endpoints.

  • Effort: L (human: 4-6 weeks / CC+gstack: 3-5 days)
  • Risk: High
  • Pros: Cleanest architecture, S3 stores pure ciphertext, key never in AWS, per-object keys
  • Cons: Breaks pre-signed URLs, changes Reducto integration, key rotation requires re-encryption

Approach B: Encryption Sidecar / Proxy Service

Deploy a standalone "crypto-proxy" Kubernetes service between all services and S3. Services swap S3 endpoints to the proxy. Same envelope encryption but centralized.

  • Effort: XL (human: 6-8 weeks / CC+gstack: 5-7 days)
  • Risk: Very High — new single point of failure, double network hop
  • Pros: Minimal per-service code changes, centralized audit logging
  • Cons: New service to maintain, latency for every S3 op, still breaks pre-signed URLs

Approach C: Scoped KMS Key Policy + S3 Bucket Policy (Rejected by EY)

Lock down XKS KMS key policy to only allow service roles to call kms:Decrypt. Zero app code changes.

  • Effort: S (human: 2-3 days / CC+gstack: 2-4 hours)
  • Risk: Low
  • Pros: Zero app changes, pre-signed URLs work, no latency impact
  • Cons: EY already rejected — still depends on AWS IAM, doesn't satisfy "ciphertext independent of AWS" requirement

Recommended Approach

Approach A: Envelope Encryption Library

Technical Design

Encryption Scheme

WRITE PATH:
1. Generate random 256-bit DEK (Data Encryption Key)
2. Encrypt file content with AES-256-GCM using DEK
3. Call Azure Key Vault to wrap DEK with org's master key (KEK)
4. Store to S3:
   - Body: ciphertext
   - Metadata: wrapped_dek, iv, auth_tag, kek_version, encryption_version

READ PATH:
1. Read S3 object (ciphertext + metadata)
2. Call Azure Key Vault to unwrap DEK using org's KEK
3. Decrypt ciphertext with AES-256-GCM using DEK + IV + auth_tag
4. Return plaintext to caller

Envelope Encryption Rationale

  • Per-object DEKs: Each file has a unique key. Compromising one DEK only exposes one file.
  • KEK in Azure Key Vault: The Key Encryption Key (master key) never leaves Azure. EY controls it.
  • Wrapped DEK in S3 metadata: The DEK is stored encrypted alongside the object. Useless without Azure Key Vault access.
  • AES-256-GCM with chunked streaming: Standard AES-GCM requires the full ciphertext for auth tag verification. For files >50MB, use a chunked scheme: split the file into fixed-size chunks (e.g., 1MB), each encrypted with its own GCM nonce derived from the DEK + chunk index. Each chunk has its own auth tag stored in a manifest. This follows the STREAM construction pattern (see AWS S3 CSE v3 spec for reference implementation). Small files (<50MB) can use single-shot GCM. Exact chunk format details (manifest storage, nonce derivation function) will be finalized during the library build phase.

DEK Caching

To avoid an Azure Key Vault API call on every S3 read, cache unwrapped DEKs in-memory with a short TTL (5 minutes). The cache is keyed by (bucket, s3_key, wrapped_dek_hash). Cache invalidation happens on TTL expiry or process restart. This is acceptable because the DEK is already stored (wrapped) in S3 metadata — caching the unwrapped version for a short window reduces AKV latency from ~100-500ms per read to near-zero for repeated reads of the same file.

Library Interface (capitol-crypto)

# capitol_crypto/s3.py

class EncryptedS3Client:
    """Drop-in replacement for S3 operations with envelope encryption."""

    def __init__(self, s3_client, azure_kv_client, kek_name: str):
        ...

    async def put_object(self, bucket: str, key: str, body: bytes, **kwargs) -> dict:
        """Encrypt body with fresh DEK, wrap DEK via Azure KV, upload to S3."""
        ...

    async def get_object(self, bucket: str, key: str, **kwargs) -> bytes:
        """Download from S3, unwrap DEK via Azure KV, decrypt body."""
        ...

    async def get_object_stream(self, bucket: str, key: str) -> AsyncIterator[bytes]:
        """Streaming decrypt for large files (avoids loading entire file in memory)."""
        ...

    async def generate_download_url(self, bucket: str, key: str, expires_in: int) -> str:
        """Generate a signed proxy URL (NOT S3 pre-signed URL).

        Returns a URL like: /api/v1/files/download?token=<JWT>
        The JWT contains: bucket, key, exp (expiry), iat (issued at),
        signed with a service-level HMAC secret from SSM Parameter Store.
        The download endpoint validates the JWT signature and expiry
        before streaming decrypted content.
        """
        ...

Encryption Decision Flow

For every S3 operation, the service resolves whether to use the encrypted client:

S3 operation requested
  │
  ├─ Resolve org_id from request context
  │   (platform-api: from auth token → user → org membership)
  │   (agentic-backend: from workflow record → org_id)
  │   (qdrant-svc: from collection record → org_id)
  │   (ingestion-pipeline: from S3 event key prefix → org_id)
  │
  ├─ Look up org record in DynamoDB table:
  │   <workspace>-<client>-organizations_v1
  │
  ├─ Check org.encryption_enabled
  │   ├─ false (or absent) → Use standard S3 client (no change)
  │   └─ true → Use EncryptedS3Client with org.encryption_kek_name
  │             and org.encryption_kek_vault_url
  │
  └─ Proceed with S3 operation

Per-Service Changes

platform-api (uploads/downloads):

  • Replace s3_client.put_object() with encrypted_s3.put_object() for EY orgs
  • Replace pre-signed URL generation with proxy download endpoint
  • New endpoint: GET /api/v1/files/{file_id}/download — streams decrypted content
  • Conditional: check org.encryption_enabled flag before using encrypted client

qdrant-svc (parsing pipeline):

  • Before sending to Reducto: decrypt S3 object in-memory
  • Stream decrypted bytes to Reducto API (HTTP upload, not S3 path)
  • May need to switch Reducto integration from "S3 path" mode to "file upload" mode

agentic-backend (workflow files + outputs):

  • Replace S3 read/write with encrypted client for EY orgs
  • Replace pre-signed URLs for file preview with proxy download
  • Worker pods need Azure Key Vault credentials
  • Also calls Reducto.ai for document parsing — needs same decrypt-before-upload treatment as qdrant-svc

platform-ingestion-pipeline — NO CHANGES NEEDED:

  • File ingestion is now 100% Celery-based (runs within the existing EKS services), not Lambda
  • The platform-ingestion-pipeline Terraform module creates the S3 bucket but does not process files
  • All file processing happens in platform-api and qdrant-svc Celery workers, which are already covered above

Infrastructure Changes (Terraform)

  • Azure Key Vault: New key per EY org (or reuse XKS wrapping key)
  • IAM: Service roles need permission to call Azure Key Vault API (network path)
  • Kubernetes: Azure KV credentials as Kubernetes secrets or via Workload Identity
  • DynamoDB: New field encryption_enabled on organization record
  • No Lambda changes needed (ingestion is Celery-based)

Data Model

New fields on the organizations DynamoDB table (DynamoDB is schemaless for non-key attributes, so no Terraform table definition changes needed — fields are added at the application layer):

Field Type Description
encryption_enabled Boolean Enable app-level encryption for this org
encryption_kek_name String Azure Key Vault key name for this org's KEK
encryption_kek_vault_url String Azure Key Vault URL

Migration Strategy

New files: Encrypted on write. No migration needed.

Existing files: Must be re-encrypted in a batch job:

  1. Read each object (decrypted via XKS transparently)
  2. Re-encrypt with envelope encryption
  3. Overwrite in S3 with ciphertext + metadata
  4. Record migrated objects in a checkpoint DynamoDB table for idempotency

Estimated scope: depends on EY's current data volume. Can be a one-time batch job.

Rollout Sequence (three EY workspaces, sequential):

  1. ey-eu-west-1 — first, as the narrowest wedge. Validate with EY InfoSec.
  2. ey-ap-southeast-1 — second, after EU validation confirms the approach works.
  3. ey-us-east-2 — last.

Each workspace has its own DynamoDB org table (<workspace>-ey-organizations_v1), S3 buckets, and service deployments. Migration batch job runs per-workspace. Enable encryption_enabled per-org after migration completes for that workspace.

Rollback Plan

If app-level encryption causes production issues after deployment:

  1. Immediate: Disable encryption_enabled flag on the org's DynamoDB record. Services fall back to standard S3 client (reads will fail for already-encrypted objects).
  2. Data rollback: Run reverse batch job — read encrypted objects via EncryptedS3Client, re-upload as plaintext (SSE-KMS with XKS takes over). The migration checkpoint table tracks which objects were encrypted, enabling targeted rollback.
  3. Code rollback: Revert service deployments via kubectl rollout undo. The library is a dependency — removing it requires a code revert + redeploy.

The migration batch job is designed to be idempotent: re-running it skips already-migrated objects (checked via S3 metadata encryption_version field).

Failure Modes

Failure Impact Mitigation
Azure Key Vault unreachable All encrypted file reads/writes fail for that org Circuit breaker with 3 retries + exponential backoff. Return HTTP 503 "Encryption service temporarily unavailable" to user. Alert via Sentry.
Partial write (file uploaded, DEK wrap fails) Orphaned plaintext object in S3 Wrap DEK BEFORE uploading ciphertext. If wrap fails, abort before S3 write. Never store plaintext.
EY revokes KEK in Azure Key Vault All EY files become permanently unreadable Expected behavior — this is EY's kill switch. Document this clearly for EY. Alert Capitol ops team.
DEK cache stale after key rotation Reads succeed with old cached DEK (still valid for existing files) KEK rotation only affects new DEKs. Old wrapped DEKs still unwrap with old KEK version if AKV retains it. Cache TTL ensures refresh within 5 minutes.

Pre-signed URL Replacement

Current flow:

Browser → pre-signed S3 URL → S3 (returns plaintext via KMS)

New flow:

Browser → platform-api proxy endpoint → decrypt in pod → stream to browser

Impact:

  • Latency increase: ~100-500ms per file (Azure KV call + decrypt)
  • Memory: must stream, not buffer (important for files >50MB)
  • Pod resource limits may need increase for download-heavy workloads

Open Questions

  1. Azure Key Vault access from EKS: Do the service pods have network access to Azure Key Vault? The XKS proxy EC2 instance does, but EKS pods may need a new network path.
  2. Reducto.ai integration mode: Does qdrant-svc currently pass S3 paths to Reducto, or upload file bytes? If S3 paths, this is a bigger change.
  3. Lambda cold start impact: Not applicable — ingestion is Celery-based, no Lambda.
  4. Key rotation: When EY rotates their KEK in Azure Key Vault, do we re-encrypt all DEKs? Or version the KEK and handle multi-version unwrap?
  5. File size limits: What's the largest file EY uploads? Streaming encrypt/decrypt has different complexity at 10MB vs 1GB.
  6. Existing data volume: How many objects exist in EY buckets today? Determines migration batch job duration.
  7. EY acceptance criteria: Will EY InfoSec want to verify by running aws s3 cp themselves and seeing ciphertext? Need to coordinate testing.
  8. Azure Key Vault region for APAC: Which AKV region will EY use? If the vault is in Europe/US, cross-Pacific latency from ey-ap-southeast-1 could exceed 500ms per unique file read. EY may need a regional AKV instance, or the DEK cache TTL should be extended for APAC.

Success Criteria

  1. aws s3 cp on any EY org file returns ciphertext (not readable content)
  2. Downloading the same file through the Capitol AI platform returns readable content (for authorized users with matching email domain)
  3. File upload, download, preview, and Reducto parsing all work correctly for EY orgs
  4. Non-EY workspaces are completely unaffected
  5. EY InfoSec signs off on the implementation

Distribution Plan

  • capitol-crypto library: Published to Capitol AI's private CodeArtifact repository
  • Consumed by: platform-api, qdrant-svc, agentic-backend as a pip dependency
  • Infrastructure: Terraform modules for Azure KV keys, IAM policies, Kubernetes secrets

Dependencies

  • Azure Key Vault API accessible from EKS pods (network path)
  • Reducto.ai supports file upload mode (not just S3 paths)
  • EY provides/approves a KEK in their Azure Key Vault
  • No Lambda dependencies (ingestion is Celery-based in EKS)

LOE Summary

Component CC+gstack Estimate Human Estimate Notes
capitol-crypto library 4-6 hours 1 week Core encrypt/decrypt + streaming + tests
platform-api changes 4-6 hours 1 week Proxy download endpoint, upload encryption
qdrant-svc changes 3-4 hours 1 week Decrypt before Reducto, integration mode change
agentic-backend changes 3-4 hours 3-4 days Replace S3 ops + pre-signed URLs
Terraform (infra) 2-3 hours 2-3 days Azure KV keys, IAM, K8s secrets, DynamoDB fields
Migration batch job 2-3 hours 2-3 days Re-encrypt existing EY files
Testing & validation 4-6 hours 1 week End-to-end across all 3 services
TOTAL ~2-4 days ~3-5 weeks

The Assignment

Before committing to building this, re-present Approach C (KMS key policy lockdown) to EY one more time with this specific framing:

"We can configure the encryption key policy so that Capitol admin IAM users cannot decrypt your files — even downloading via the AWS console returns encrypted bytes. Your Azure Key Vault logs every single decryption call. The only principals that can decrypt are the application service roles that serve your authorized users. This is enforced at the AWS KMS level, not just application code."

If EY still rejects it after hearing this framing, then proceed with Approach A. The design doc above gives you the full implementation plan and LOE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment