vishnu2kmohan/Project-Plan-Options.md

Customer Deployment Project Plans

Context

Two project plan templates (as Mermaid Gantt charts) for deploying the product at customer sites. The two models differ in who manages the deployment lifecycle:

Vendor-Managed — The vendor provisions, deploys, and operates the platform; the customer approves changes
Customer-Managed — Customer IT handles provisioning and operations; the vendor trains and advises

Both models deploy Staging + Production environments (separate GKE clusters each). The customer always provides a GCP project, VPC, and on-prem connectivity as prerequisites.

Vendor access (vendor-managed model): Workload Identity Federation (WIF) — no long-lived service account keys, full Cloud Audit Logs trail, instant revocation via WIF pool deletion. Integrates natively with GitHub Actions OIDC for automated Terraform runs.

Plan 1: Vendor-Managed Deployment (~9 Weeks)

The vendor executes all infrastructure provisioning, cluster add-on deployment, ArgoCD configuration, and application deployment. Customer responsibilities are limited to prerequisites, approvals, and UAT.

gantt
    title Vendor-Managed Deployment
    dateFormat  YYYY-MM-DD
    axisFormat  %b %d

    section Phase 0 - Prerequisites [Customer]
    Provision GCP project and set billing budget alerts      :cust_gcp, 2026-03-02, 3d
    Create VPC and grant subnet access                       :cust_vpc, after cust_gcp, 2d
    Establish on-prem connectivity via VPN                   :cust_vpn, after cust_gcp, 5d
    Configure WIF pool and OIDC provider                     :cust_wif, after cust_vpc, 3d
    Create staging deployment SA and assign scoped IAM roles  :cust_sa_stg, after cust_wif, 1d
    Create production deployment SA and assign scoped IAM roles :cust_sa_prod, after cust_sa_stg, 1d
    Grant workloadIdentityUser on both SAs to WIF pool       :cust_iam, after cust_sa_prod, 1d
    Vendor validates WIF and SA impersonation for both SAs   :cust_wifval, after cust_iam, 1d
    Provision bastion host or validate VPN kubectl access path :cust_access, after cust_vpn, 2d
    Prerequisites complete                                   :milestone, prereq_done, after cust_wifval cust_access, 0d

    section Phase 1 - Planning and Design [Joint]
    Kickoff and requirements gathering                       :kickoff, after prereq_done, 2d
    Architecture review - regional GKE, HA, SLOs, RTO-RPO   :arch, after kickoff, 2d
    Network design - VPC, subnets, NAT, firewall, peering   :netdesign, after kickoff, 2d
    Environment config - terraform.tfvars for stg and prod   :envconfig, after arch, 1d
    Customer sign-off on design                              :milestone, design_approved, after envconfig netdesign, 0d

    section Phase 2 - Staging Infrastructure [Vendor]
    Create Terraform state bucket for staging                :stg_state, after design_approved, 1d
    Create KMS key and grant service agent roles             :stg_kms, after stg_state, 1d
    Terraform - GCP APIs, VPC peering, Cloud NAT             :stg_net, after stg_kms, 1d
    Terraform - regional GKE cluster with monitoring_config and database_encryption :stg_gke, after stg_net, 2d
    Terraform - Cloud SQL PostgreSQL 15 private IP           :stg_sql, after stg_net, 1d
    Terraform - GCS buckets                                  :stg_gcs, after stg_net, 1d
    Terraform - Workload Identity SAs and K8s bindings       :stg_wi, after stg_gke, 1d
    Validate GMP metrics collection and Secrets encryption   :stg_gmpval, after stg_gke, 1d
    Cloud SQL post-deploy - extensions, users, secrets       :stg_sqlpost, after stg_sql stg_gke, 1d
    Security hardening - PSS, NetworkPolicies, Shielded Nodes :stg_security, after stg_gke, 1d
    IaC validation - terraform validate, tflint, tfsec       :stg_iactest, after stg_security, 1d
    IaC gate - zero critical tfsec findings                  :milestone, stg_iac_gate, after stg_iactest, 0d

    section Phase 3 - Staging Add-ons [Vendor]
    Deploy nginx-ingress and Cloud Armor WAF policy          :stg_nginx, after stg_wi, 1d
    Deploy cert-manager v1.14.0 and ClusterIssuers           :stg_cert, after stg_nginx, 1d
    Deploy external-secrets v0.9.11                          :stg_es, after stg_wi, 1d
    Deploy external-dns v1.14.0                              :stg_edns, after stg_cert, 1d
    Deploy ArgoCD v5.51.6                                    :stg_argo, after stg_cert stg_nginx, 1d
    Deploy Image Updater with Git write-back and token CronJob :stg_imgupd, after stg_argo, 1d
    Enable GKE Managed Prometheus and Cloud Monitoring        :stg_monitor, after stg_argo, 1d
    Configure Cloud Logging sinks, retention, and alerts     :stg_audit, after stg_monitor, 1d
    Deploy Binary Authorization or OPA Gatekeeper policies   :stg_binauth, after stg_argo, 1d
    Create K8s namespaces and service accounts               :stg_ns, after stg_imgupd, 1d
    DNS delegation and NS record verification                :stg_dns, after stg_edns, 1d

    section Phase 4 - Staging App Deployment [Vendor]
    Configure ArgoCD repo access for product                  :stg_repo, after stg_imgupd, 1d
    Apply ArgoCD Project and Application with Image Updater annotations :stg_app, after stg_repo stg_ns, 1d
    Initial sync and health validation                       :stg_sync, after stg_app, 1d
    Keycloak config - realm, clients, IdP, RBAC, MFA        :stg_kc, after stg_sync, 2d
    ArgoCD RBAC hardening and admin account controls         :stg_argohard, after stg_sync, 1d
    Populate GCP Secret Manager secrets with CMEK            :stg_secrets, after stg_sync, 1d
    Secrets audit - no secrets in Git, least-privilege       :stg_secaudit, after stg_secrets, 1d
    SSL certificate validation                               :stg_ssl, after stg_dns stg_sync, 1d
    Image scanning and chart provenance verification         :stg_imgscan, after stg_sync, 1d
    End-to-end smoke test all services                       :stg_smoke, after stg_kc stg_ssl stg_secaudit stg_imgscan stg_argohard, 2d
    Smoke test gate - all health checks pass                 :milestone, stg_smoke_gate, after stg_smoke, 0d
    Load testing against defined SLOs and capacity targets   :stg_loadtest, after stg_smoke_gate, 2d
    Load test gate - meets p95 latency and error thresholds  :milestone, stg_load_gate, after stg_loadtest, 0d
    Backup restore and Cloud SQL failover drill              :stg_backup, after stg_smoke_gate, 1d
    Staging deployment complete                              :milestone, stg_done, after stg_backup stg_load_gate, 0d

    section Phase 5 - Staging UAT [Customer and Vendor]
    Customer UAT on staging                                  :stg_uat, after stg_done, 5d
    Bug fixes and configuration adjustments                  :stg_fix, after stg_uat, 2d
    Customer staging sign-off                                :milestone, stg_signoff, after stg_fix, 0d

    section Phase 6 - Production Infrastructure [Vendor]
    Create Terraform state bucket for prod                   :prod_state, after stg_signoff, 1d
    Create KMS key for prod and grant service agent roles    :prod_kms, after prod_state, 1d
    Terraform - GCP APIs, VPC peering, Cloud NAT             :prod_net, after prod_kms, 1d
    Terraform - regional GKE cluster with monitoring_config and database_encryption :prod_gke, after prod_net, 2d
    Terraform - Cloud SQL PostgreSQL 15 Regional HA          :prod_sql, after prod_net, 1d
    Terraform - GCS buckets and Workload Identity            :prod_misc, after prod_gke, 1d
    Validate GMP metrics collection and Secrets encryption   :prod_gmpval, after prod_gke, 1d
    Cloud SQL post-deploy                                    :prod_sqlpost, after prod_sql prod_gke, 1d
    Security hardening - PSS, NetworkPolicies, Shielded Nodes :prod_security, after prod_gke, 1d
    IaC validation - terraform validate, tflint, tfsec       :prod_iactest, after prod_security, 1d
    IaC gate - zero critical tfsec findings                  :milestone, prod_iac_gate, after prod_iactest, 0d

    section Phase 7 - Production Add-ons and Apps [Vendor]
    Deploy nginx-ingress with Cloud Armor WAF policy          :prod_addons1, after prod_misc, 2d
    Deploy cert-manager, external-secrets, external-dns      :prod_addons1b, after prod_addons1, 1d
    Deploy ArgoCD                                            :prod_addons2, after prod_addons1b, 1d
    Deploy Image Updater with Git write-back and token CronJob :prod_imgupd, after prod_addons2, 1d
    Enable GKE Managed Prometheus and Cloud Monitoring        :prod_monitor_deploy, after prod_addons2, 1d
    Configure Cloud Logging sinks, retention, and alerts     :prod_audit, after prod_monitor_deploy, 1d
    Deploy Binary Authorization or OPA Gatekeeper policies   :prod_binauth, after prod_addons2, 1d
    Create K8s namespaces and service accounts               :prod_ns, after prod_imgupd, 1d
    DNS delegation for prod                                  :prod_dns, after prod_imgupd, 1d
    ArgoCD repo, project, app with Image Updater annotations :prod_argo, after prod_ns, 1d
    Initial sync and health validation                       :prod_sync, after prod_argo, 1d
    Secrets setup with CMEK and rotation                     :prod_secrets, after prod_sync, 1d
    Image scanning and chart provenance verification         :prod_imgscan, after prod_sync, 1d
    Keycloak config - realm, clients, IdP, RBAC, MFA        :prod_kc, after prod_secrets prod_dns, 2d
    ArgoCD RBAC hardening and admin account controls         :prod_argohard, after prod_sync, 1d
    End-to-end smoke test                                    :prod_smoke, after prod_kc prod_argohard, 2d
    Smoke test gate - all health checks pass                 :milestone, prod_smoke_gate, after prod_smoke, 0d
    Backup restore and Cloud SQL HA failover drill           :prod_backup, after prod_smoke_gate, 1d
    Load testing against defined SLOs and capacity targets   :prod_loadtest, after prod_smoke_gate, 2d
    Load test gate - meets p95 latency and error thresholds  :milestone, prod_load_gate, after prod_loadtest, 0d
    Production deployment complete                           :milestone, prod_done, after prod_backup prod_load_gate, 0d

    section Phase 8 - Go-Live [Joint]
    Customer UAT on production                               :prod_uat, after prod_done, 3d
    SLO dashboards and alert routing validation              :prod_monitor, after prod_done, 2d
    Incident runbooks and escalation procedures              :prod_runbooks, after prod_done, 3d
    Security assessment and vulnerability scan               :prod_secassess, after prod_done, 2d
    Security gate - no critical or high vulnerabilities      :milestone, prod_sec_gate, after prod_secassess, 0d
    Rollback drill - ArgoCD, Terraform state, DB migration   :prod_rollback, after prod_done, 2d
    Cross-region DR failover drill (if DR selected)          :prod_dr_drill, after prod_rollback, 2d
    GitOps repository failover test (if DR selected)         :prod_git_dr, after prod_dr_drill, 1d
    Go-live approval                                         :milestone, golive, after prod_uat prod_sec_gate prod_monitor prod_runbooks prod_git_dr, 0d
    Go-live cutover - DNS and final alerting                 :cutover, after golive, 1d
    Go-live complete                                         :milestone, live, after cutover, 0d

    section Phase 9 - Post Go-Live [Vendor]
    Hypercare - vendor-monitored, on-call                    :hypercare, after live, 10d
    Handover to steady-state support                         :milestone, steady, after hypercare, 0d

    section Ongoing - App Image Updates [Staging Auto, Prod Gated]
    New image pushed to container registry                   :upd_img_push, after steady, 1d
    Image Updater detects new tag in staging                 :upd_img_detect, after upd_img_push, 0d
    Image Updater writes tag back to Git (staging branch)    :upd_img_write, after upd_img_detect, 0d
    ArgoCD syncs new image to staging cluster                :upd_img_sync_stg, after upd_img_write, 0d
    Vendor validates staging deployment health               :upd_img_validate_stg, after upd_img_sync_stg, 1d
    Vendor opens promotion PR for production                 :upd_img_promote_pr, after upd_img_validate_stg, 1d
    Customer approves production promotion PR                :upd_img_approve, after upd_img_promote_pr, 1d
    ArgoCD syncs approved image to production                :upd_img_sync_prod, after upd_img_approve, 0d
    Vendor validates production deployment health            :upd_img_validate_prod, after upd_img_sync_prod, 1d
    Image update complete                                    :milestone, upd_img_done, after upd_img_validate_prod, 0d

    section Ongoing - Infra and Config Changes [Vendor-Managed]
    Vendor proposes change via protected branch PR           :upd_propose, after upd_img_done, 2d
    Image scanning and chart provenance verification         :upd_scan, after upd_propose, 1d
    Customer reviews and approves with required reviewers    :upd_approve, after upd_scan, 2d
    Vendor merges within ArgoCD sync window                  :upd_deploy, after upd_approve, 1d
    Vendor validates deployment health                       :upd_validate, after upd_deploy, 1d
    Update cycle complete                                    :milestone, upd_done, after upd_validate, 0d

Vendor-Managed: Key Characteristics

Vendor owns execution of all Terraform and application deployment tasks
Two-layer deployment: Terraform deploys cluster add-ons via helm_release resources (nginx, cert-manager, external-secrets, external-dns, ArgoCD); ArgoCD then manages only the application workloads via GitOps
Customer touchpoints: prerequisites (incl. billing budget alerts), design sign-off, UAT (x2), go-live approval
GKE topology: regional cluster, multi-zone node pools, HPA/VPA autoscaling, PodDisruptionBudgets
Cloud SQL: PostgreSQL 15, private IP via VPC peering, Regional HA for production, failover drill before go-live
Security hardening: Cloud Armor WAF with pre-configured OWASP Top 10 rulesets (SQL injection, XSS, RCE, LFI) and rate-limiting policies applied to the nginx-ingress backend service; Pod Security Standards (Restricted profile); NetworkPolicies (default-deny with explicit allow rules); Keycloak RBAC with MFA; ArgoCD admin controls; image admission policies; GKE Application-layer Secrets Encryption with CMEK
Observability (all GCP-native): GKE Managed Prometheus for metrics collection (PromQL-compatible, stored in Cloud Monitoring); Cloud Monitoring for dashboards, SLOs, and alerting; Cloud Logging for centralized log aggregation (enabled by default on GKE); Cloud Audit Logs for security and compliance — no self-managed observability stack to operate
App image updates: staging is automatic via ArgoCD Image Updater with Git write-back for auditability; production requires a promotion PR approved by the customer before ArgoCD syncs — new images are validated in staging first, then promoted via a gated workflow
Infra and config changes: vendor submits PR → IaC validation (tflint, tfsec) → image scan + provenance check → customer reviews diff → vendor merges within ArgoCD sync window
Rollback: validated drill covering ArgoCD rollback, Terraform state recovery, and DB migration backout using Flyway/Liquibase with backward-compatible migrations
WIF access: tightly scoped custom IAM roles via SA impersonation, full audit trail with log sinks and security alerting, no long-lived credentials
Supply chain security: Binary Authorization / OPA Gatekeeper enforced as mandatory pre-production milestone with exception management
GKE clusters: private clusters with master authorized networks, Shielded Nodes (requires shielded_instance_config { enable_secure_boot = true, enable_integrity_monitoring = true } in the GKE node pool Terraform config)
Private endpoint policy: The architecture review must decide whether to fully disable the public control plane endpoint (enable_private_endpoint = true) or restrict access via master authorized networks only. The existing Terraform module defaults to enable_private_endpoint = false; update to true if the customer's security policy requires it and a validated bastion/VPN access path is confirmed.
Maintenance lifecycle: defined GKE upgrade cadence, add-on versioning, and patching policy for post-go-live
Deliverables: architecture diagram, runbooks, access matrix, SLO definitions, maintenance schedule

Plan 2: Customer-Managed Deployment (~13 Weeks)

Customer IT performs all provisioning and operations. The vendor provides training, documentation, review checkpoints, and validation at key milestones.

gantt
    title Customer-Managed Deployment
    dateFormat  YYYY-MM-DD
    axisFormat  %b %d

    section Phase 0 - Prerequisites [Customer]
    Provision GCP project and set billing budget alerts      :cust_gcp, 2026-03-02, 3d
    Create VPC and subnets                                   :cust_vpc, after cust_gcp, 2d
    Establish on-prem connectivity via VPN                   :cust_vpn, after cust_gcp, 5d
    Set up IAM roles for customer IT team                    :cust_iam, after cust_vpc, 2d
    Provision bastion host or validate VPN kubectl access path :cust_access, after cust_vpn, 2d
    Prerequisites complete                                   :milestone, prereq_done, after cust_iam cust_access, 0d

    section Phase 1 - Training GCP and Terraform [Vendor-led]
    GCP fundamentals - IAM, VPC, GKE                        :train_gcp, after prereq_done, 2d
    Terraform modules walkthrough                            :train_tf, after train_gcp, 2d
    Terraform state management and best practices            :train_state, after train_tf, 1d
    Hands-on lab - deploy dev environment guided             :train_lab1, after train_state, 2d
    Training checkpoint                                      :milestone, train1_done, after train_lab1, 0d

    section Phase 2 - Training Kubernetes and Add-ons [Vendor-led]
    GKE cluster operations and node management               :train_gke, after train1_done, 1d
    Helm add-ons walkthrough                                 :train_helm, after train_gke, 2d
    Workload Identity and service account bindings           :train_wi, after train_helm, 1d
    Cloud SQL operations and post-deployment steps           :train_sql, after train_gke, 1d
    Training checkpoint                                      :milestone, train2_done, after train_wi train_sql, 0d

    section Phase 3 - Training ArgoCD and GitOps [Vendor-led]
    ArgoCD concepts, architecture, RBAC                      :train_argo, after train2_done, 1d
    ArgoCD applications, projects, sync policies             :train_argoapp, after train_argo, 1d
    GitOps workflow - CICD and ArgoCD integration            :train_gitops, after train_argoapp, 1d
    Monitoring, alerting, troubleshooting runbooks           :train_ops, after train_gitops, 1d
    Hands-on lab - full stack deployment guided              :train_lab2, after train_ops, 2d
    All training complete                                    :milestone, train_done, after train_lab2, 0d

    section Phase 4 - Planning and Design [Joint]
    Architecture review - regional GKE, HA, SLOs, RTO-RPO   :arch, after train_done, 2d
    Network design - VPC, subnets, NAT, firewall, peering   :netdesign, after arch, 1d
    Customer prepares terraform.tfvars                       :envconfig, after arch, 2d
    Vendor reviews customer config                           :vendorreview, after envconfig, 1d
    Select vendor validation model (access, evidence, or screen-share) :valmodel, after vendorreview, 1d
    Design sign-off                                          :milestone, design_approved, after valmodel netdesign, 0d

    section Phase 5 - Staging Infrastructure [Customer IT]
    Create Terraform state bucket for staging                :stg_state, after design_approved, 1d
    Create KMS key and grant service agent roles             :stg_kms, after stg_state, 1d
    Terraform plan review with vendor                        :stg_planrev, after stg_kms, 1d
    Terraform - GCP APIs, VPC peering, Cloud NAT             :stg_net, after stg_planrev, 2d
    Terraform - regional GKE cluster with monitoring_config and database_encryption :stg_gke, after stg_net, 2d
    Terraform - Cloud SQL PostgreSQL 15 private IP           :stg_sql, after stg_net, 2d
    Terraform - GCS buckets                                  :stg_gcs, after stg_net, 1d
    Terraform - Workload Identity                            :stg_wi, after stg_gke, 1d
    Validate GMP metrics collection and Secrets encryption   :stg_gmpval, after stg_gke, 1d
    Cloud SQL post-deploy - extensions and users             :stg_sqlpost, after stg_sql stg_gke, 1d
    Security hardening - PSS, NetworkPolicies, Shielded Nodes :stg_security, after stg_gke, 1d
    IaC validation - terraform validate, tflint, tfsec       :stg_iactest, after stg_security, 1d
    IaC gate - zero critical tfsec findings                  :milestone, stg_iac_gate, after stg_iactest, 0d
    Vendor validates staging infra                           :stg_validate, after stg_wi stg_sqlpost stg_gcs stg_iac_gate stg_gmpval, 1d

    section Phase 6 - Staging Add-ons and Apps [Customer IT]
    Deploy nginx-ingress with Cloud Armor WAF policy          :stg_addons1, after stg_validate, 2d
    Deploy external-secrets and external-dns                 :stg_addons2, after stg_addons1, 1d
    Deploy ArgoCD                                            :stg_argo, after stg_addons1, 1d
    Deploy Image Updater with Git write-back and token CronJob :stg_imgupd, after stg_argo, 1d
    Enable GKE Managed Prometheus and Cloud Monitoring        :stg_monitor, after stg_argo, 1d
    Configure Cloud Logging sinks, retention, and alerts     :stg_audit, after stg_monitor, 1d
    Deploy Binary Authorization or OPA Gatekeeper policies   :stg_binauth, after stg_argo, 1d
    Create K8s namespaces and service accounts               :stg_ns, after stg_imgupd, 1d
    DNS delegation and NS record verification                :stg_dns, after stg_addons2, 2d
    Configure ArgoCD - repo, project, app with Image Updater annotations :stg_argoconf, after stg_imgupd stg_ns, 2d
    Initial sync and health validation                       :stg_sync, after stg_argoconf, 1d
    Secrets setup with CMEK and rotation                     :stg_secrets, after stg_sync, 1d
    Image scanning and chart provenance verification         :stg_imgscan, after stg_sync, 1d
    Keycloak config - realm, clients, IdP, RBAC, MFA        :stg_kc, after stg_secrets, 2d
    ArgoCD RBAC hardening and admin account controls         :stg_argohard, after stg_sync, 1d
    Secrets audit - no secrets in Git, least-privilege       :stg_secaudit, after stg_secrets, 1d
    SSL certificate validation                               :stg_ssl, after stg_dns stg_sync, 1d
    End-to-end smoke test                                    :stg_smoke, after stg_kc stg_ssl stg_secaudit stg_imgscan stg_argohard, 2d
    Smoke test gate - all health checks pass                 :milestone, stg_smoke_gate, after stg_smoke, 0d
    Load testing against defined SLOs and capacity targets   :stg_loadtest, after stg_smoke_gate, 2d
    Load test gate - meets p95 latency and error thresholds  :milestone, stg_load_gate, after stg_loadtest, 0d
    Backup restore and Cloud SQL failover drill              :stg_backup, after stg_smoke_gate, 1d
    Vendor validates staging deployment                      :stg_vendorval, after stg_backup stg_load_gate, 1d
    Staging deployment complete                              :milestone, stg_done, after stg_vendorval, 0d

    section Phase 7 - Staging UAT [Customer]
    Customer UAT on staging                                  :stg_uat, after stg_done, 5d
    Bug fixes and config adjustments                         :stg_fix, after stg_uat, 3d
    Customer staging sign-off                                :milestone, stg_signoff, after stg_fix, 0d

    section Phase 8 - Production Infrastructure [Customer IT]
    Create Terraform state bucket for prod                   :prod_state, after stg_signoff, 1d
    Create KMS key for prod and grant service agent roles    :prod_kms, after prod_state, 1d
    Terraform - GCP APIs, VPC peering, Cloud NAT             :prod_net, after prod_kms, 1d
    Terraform - regional GKE cluster with monitoring_config and database_encryption :prod_gke, after prod_net, 2d
    Terraform - Cloud SQL Regional HA and GCS buckets        :prod_sql, after prod_net, 2d
    Terraform - Workload Identity                            :prod_wi, after prod_gke, 1d
    Validate GMP metrics collection and Secrets encryption   :prod_gmpval, after prod_gke, 1d
    Cloud SQL post-deploy                                    :prod_sqlpost, after prod_sql prod_gke, 1d
    Security hardening - PSS, NetworkPolicies, Shielded Nodes :prod_security, after prod_gke, 1d
    IaC validation - terraform validate, tflint, tfsec       :prod_iactest, after prod_security, 1d
    IaC gate - zero critical tfsec findings                  :milestone, prod_iac_gate, after prod_iactest, 0d

    section Phase 9 - Production Add-ons and Apps [Customer IT]
    Deploy nginx-ingress with Cloud Armor WAF, cert-manager   :prod_addons, after prod_wi, 2d
    Deploy ArgoCD, external-secrets, external-dns            :prod_addons2, after prod_addons, 1d
    Deploy Image Updater with Git write-back and token CronJob :prod_imgupd, after prod_addons2, 1d
    Enable GKE Managed Prometheus and Cloud Monitoring        :prod_monitor_deploy, after prod_addons2, 1d
    Configure Cloud Logging sinks, retention, and alerts     :prod_audit, after prod_monitor_deploy, 1d
    Deploy Binary Authorization or OPA Gatekeeper policies   :prod_binauth, after prod_addons2, 1d
    ArgoCD config - repo, project, app with Image Updater annotations :prod_argo, after prod_imgupd, 2d
    Initial sync and health validation                       :prod_sync, after prod_argo, 1d
    Secrets setup with CMEK and rotation                     :prod_secrets, after prod_sync, 1d
    Image scanning and chart provenance verification         :prod_imgscan, after prod_sync, 1d
    Keycloak config - realm, clients, IdP, RBAC, MFA        :prod_kc, after prod_secrets, 2d
    ArgoCD RBAC hardening and admin account controls         :prod_argohard, after prod_sync, 1d
    SSL certificate and DNS validation                       :prod_ssl, after prod_addons prod_sync, 2d
    End-to-end smoke test                                    :prod_smoke, after prod_kc prod_ssl prod_argohard, 2d
    Smoke test gate - all health checks pass                 :milestone, prod_smoke_gate, after prod_smoke, 0d
    Backup restore and Cloud SQL HA failover drill           :prod_backup, after prod_smoke_gate, 1d
    Load testing against defined SLOs and capacity targets   :prod_loadtest, after prod_smoke_gate, 2d
    Load test gate - meets p95 latency and error thresholds  :milestone, prod_load_gate, after prod_loadtest, 0d
    Vendor validates production deployment                   :prod_vendorval, after prod_backup prod_load_gate, 1d
    Production deployment complete                           :milestone, prod_done, after prod_vendorval, 0d

    section Phase 10 - Go-Live [Customer]
    Customer UAT on production                               :prod_uat, after prod_done, 3d
    SLO dashboards and alert routing validation              :prod_monitor, after prod_done, 2d
    Incident runbooks and escalation procedures              :prod_runbooks, after prod_done, 3d
    Security assessment and vulnerability scan               :prod_secassess, after prod_done, 2d
    Security gate - no critical or high vulnerabilities      :milestone, prod_sec_gate, after prod_secassess, 0d
    Rollback drill - ArgoCD, Terraform state, DB migration   :prod_rollback, after prod_done, 2d
    Cross-region DR failover drill (if DR selected)          :prod_dr_drill, after prod_rollback, 2d
    GitOps repository failover test (if DR selected)         :prod_git_dr, after prod_dr_drill, 1d
    Go-live approval                                         :milestone, golive, after prod_uat prod_sec_gate prod_monitor prod_runbooks prod_git_dr, 0d
    Go-live cutover - DNS and final alerting                 :cutover, after golive, 1d
    Go-live complete                                         :milestone, live, after cutover, 0d

    section Phase 11 - Post Go-Live
    Hypercare - vendor tier-2 tier-3, customer monitors      :hypercare, after live, 10d
    Incident response drills                                 :postgo_drills, after live, 2d
    Knowledge transfer - advanced runbooks                   :kt_runbooks, after live, 3d
    Finalize operational readiness deliverables              :kt_deliverables, after live, 5d
    On-call rotation setup and escalation procedures         :kt_oncall, after kt_runbooks, 2d
    Handover to customer-managed steady-state                :milestone, steady, after hypercare kt_runbooks kt_deliverables kt_oncall, 0d

    section Ongoing - App Image Updates [Automatic]
    New image pushed to container registry                   :upd_img_push, after steady, 1d
    Image Updater detects tag, writes back to Git            :upd_img_stg, after upd_img_push, 0d
    ArgoCD syncs new image to staging                        :upd_img_stgsync, after upd_img_stg, 0d
    Customer validates staging                               :upd_img_stgval, after upd_img_stgsync, 1d
    Customer promotes to production via sync window          :upd_img_prod, after upd_img_stgval, 1d
    Image update complete                                    :milestone, upd_img_done, after upd_img_prod, 0d

    section Ongoing - Infra and Config Changes [Customer-Managed]
    Vendor publishes release notes and Terraform changes     :upd_release, after upd_img_done, 1d
    Image scanning and chart provenance verification         :upd_scan, after upd_release, 1d
    Customer IT evaluates changes                            :upd_eval, after upd_scan, 3d
    Customer applies Terraform to staging                    :upd_stg, after upd_eval, 2d
    Customer validates staging                               :upd_stgval, after upd_stg, 2d
    Customer applies to production via sync window           :upd_prod, after upd_stgval, 1d
    Customer validates production                            :upd_prodval, after upd_prod, 1d
    Update cycle complete                                    :milestone, upd_done, after upd_prodval, 0d

Customer-Managed: Key Characteristics

~3 weeks of training before execution begins (GCP, Terraform, GKE, Helm, ArgoCD, GitOps)
Customer IT owns execution with vendor review checkpoints at each milestone
Two-layer deployment: Terraform deploys cluster add-ons via helm_release resources (nginx, cert-manager, external-secrets, external-dns, ArgoCD); ArgoCD then manages only the application workloads via GitOps
GKE topology: regional cluster, multi-zone node pools, HPA/VPA autoscaling, PodDisruptionBudgets
Cloud SQL: PostgreSQL 15, private IP via VPC peering, Regional HA for production, failover drill before go-live
Security hardening: Cloud Armor WAF with pre-configured OWASP Top 10 rulesets (SQL injection, XSS, RCE, LFI) and rate-limiting policies; Pod Security Standards (Restricted profile); NetworkPolicies (default-deny with explicit allow rules); Keycloak RBAC with MFA; ArgoCD admin controls; image admission policies; GKE Application-layer Secrets Encryption with CMEK
Observability (all GCP-native): GKE Managed Prometheus for metrics collection (PromQL-compatible, stored in Cloud Monitoring); Cloud Monitoring for dashboards, SLOs, and alerting; Cloud Logging for centralized log aggregation (enabled by default on GKE); Cloud Audit Logs for security and compliance — no self-managed observability stack to operate
Vendor validates staging infra, staging deployment, and production deployment before progression (via defined validation model -- see Vendor Validation Model section)
App image updates: automatic via ArgoCD Image Updater with Git write-back — new images detected, tag committed to Git via least-privilege token for auditability; customer validates staging then promotes to production
Infra and config changes: vendor publishes release notes → IaC validation (tflint, tfsec) → customer evaluates, tests on staging, promotes to production within sync window
Rollback: validated drill covering ArgoCD rollback, Terraform state recovery, and DB migration backout using Flyway/Liquibase with backward-compatible migrations
No vendor access to customer GCP — customer uses their own IAM credentials; vendor validation occurs via defined model (time-bound access, evidence, or screen-share)
Supply chain security: Binary Authorization / OPA Gatekeeper enforced as mandatory pre-production milestone
GKE clusters: private clusters with master authorized networks, Shielded Nodes
Private endpoint policy: Same as vendor-managed -- architecture review decides enable_private_endpoint setting based on customer security requirements and confirmed bastion/VPN access path.
Maintenance lifecycle: defined GKE upgrade cadence, add-on versioning, and patching policy for post-go-live
Deliverables: architecture diagram, runbooks (incident response, routine operations, troubleshooting), access matrix, SLO definitions with dashboard templates, incident response and escalation procedures, maintenance schedule, DR procedures, on-call rotation template

Comparison

Dimension	Vendor-Managed	Customer-Managed
Go-live timeline	~9 weeks	~13 weeks
Training overhead	None	~3 weeks
Customer IT effort	Minimal (approvals + UAT)	Heavy (all execution)
Deployment risk	Low (vendor expertise)	Medium (learning curve)
Time to staging	~3 weeks after prereqs	~6 weeks after prereqs
Ongoing update effort (customer)	Review & approve (~hours)	Evaluate, test, apply (~days)
Vendor dependency (ongoing)	High	Low
Operational self-sufficiency	Low	High
Access model	WIF (scoped, auditable, revocable)	Customer-only IAM
Deployment project cost	Lower (faster)	Higher (training + longer timeline)
Ongoing operations cost	Higher (vendor management fee)	Lower (internal team)
Security posture	Good (WIF, audit trail)	Better (no external access)

Service Architecture: GCP-Native vs. Cluster Add-ons

The deployment maximizes GCP-managed services to reduce operational overhead. Cluster add-ons are used only where GCP lacks an equivalent or where the existing Terraform modules already configure them.

Deployment mechanism (two-layer architecture):

Layer 1 — Terraform (cluster-addons.tf): Deploys all cluster add-ons via helm_release resources as part of terraform apply. This includes nginx-ingress, cert-manager, external-secrets, external-dns, and ArgoCD itself. Terraform also creates namespaces, service accounts, and ClusterIssuers.
Layer 2 — ArgoCD: Manages only the application workloads. ArgoCD watches the application repository (separate from the infra repo) and syncs Helm releases to the em-semi-app, em-semi-workflow, and em-semi-keycloak namespaces. ArgoCD does not manage its own add-ons or other cluster infrastructure.

Category	Service	Type	Rationale
Compute	GKE (regional, private)	GCP-native	Primary workload platform
Database	Cloud SQL (PostgreSQL 15)	GCP-native	Managed HA, automated backups, PITR
Object Storage	Cloud Storage (GCS)	GCP-native	Data, cache, workflows, logs buckets
DNS	Cloud DNS	GCP-native	Managed zones, DNSSEC
DNS Automation	external-dns (v1.14.0)	Cluster add-on	K8s-native DNS record lifecycle; uses Cloud DNS as provider
NAT	Cloud NAT	GCP-native	Outbound internet for private nodes
IAM	Cloud IAM + Workload Identity	GCP-native	Pod-level identity, no key files
Secrets Backend	Secret Manager	GCP-native	CMEK encryption, audit trail, rotation
Secrets Sync	external-secrets (v0.9.11)	Cluster add-on	Bridges Secret Manager to K8s Secrets; standard GitOps pattern
Ingress	nginx-ingress (v4.9.0)	Cluster add-on	Already configured in repo with LoadBalancer service type
TLS Certificates	cert-manager (v1.14.0)	Cluster add-on	Let's Encrypt + DNS-01 via Cloud DNS; already configured in repo
GitOps	ArgoCD (v5.51.6)	Cluster add-on	No GCP equivalent; core deployment mechanism
Image Updates	ArgoCD Image Updater	Cluster add-on	Git write-back for auditability; no GCP equivalent
Metrics	GKE Managed Prometheus	GCP-native	Managed collection pipeline, PromQL-compatible, Cloud Monitoring storage
Dashboards	Cloud Monitoring	GCP-native	SLOs, alerting policies, custom dashboards
Logging	Cloud Logging	GCP-native	Enabled by default on GKE; Log Router for export
Audit	Cloud Audit Logs	GCP-native	Admin Activity, Data Access, System Event logs
Container Registry	Artifact Registry	GCP-native	Docker image storage, vulnerability scanning
Build	Cloud Build	GCP-native	CI/CD (API enabled)
Identity	Keycloak	Cluster add-on	Application-level IdP; GCP equivalent (Identity Platform) doesn't cover all use cases
Query Analytics	Cloud SQL Query Insights	GCP-native	Built-in query performance analysis; enabled by default in Terraform module
Notifications	ArgoCD Notifications	Cluster add-on	Slack, webhook, email notifications for deployment events; configured via annotations
Admission Control	Binary Authorization or OPA Gatekeeper	GCP-native or add-on	Binary Authorization is GCP-native; OPA Gatekeeper is an alternative

Self-managed components eliminated by using GCP-native services:

~~Prometheus server~~ → GKE Managed Prometheus (managed collection, Cloud Monitoring backend)
~~Grafana~~ → Cloud Monitoring dashboards (optional: deploy Grafana with Cloud Monitoring data source for advanced visualization)
~~Loki + Promtail~~ → Cloud Logging (enabled by default, zero deployment)
~~Alertmanager~~ → Cloud Monitoring alerting policies with notification channels

Actions by Role and Automation Level

Vendor-Managed Model

Phase	Action	Actor	Automation	Tooling
Prerequisites	Provision GCP project, billing alerts	Customer	Manual	GCP Console, gcloud
	Create VPC, subnets, firewall rules	Customer	Manual	GCP Console or Terraform
	Establish on-prem VPN connectivity	Customer	Manual	Cloud VPN, network appliance
	Configure WIF pool and OIDC provider	Customer	Manual	gcloud, Terraform
	Create deployment SAs and assign IAM roles	Customer	Manual	gcloud, Terraform
	Validate WIF and SA impersonation	Vendor	Manual	gcloud auth, CI test job
Planning	Kickoff, architecture review, network design	Joint	Manual	Meetings, documentation
	Environment config (terraform.tfvars)	Vendor	Manual	Text editor
	Customer design sign-off	Customer	Manual	Approval gate
Infrastructure	Terraform plan and apply (GKE, Cloud SQL, GCS, WI)	Vendor	Semi-automated	Terraform CLI or CI/CD pipeline
	IaC validation (tflint, tfsec, checkov)	Vendor	Automated	CI pipeline on PR
	Security hardening (PSS, NetworkPolicies)	Vendor	Manual	kubectl, Terraform
	Cloud SQL post-deploy (extensions, users)	Vendor	Manual	psql, gcloud
Add-ons	Deploy cluster add-ons via Terraform helm_release (nginx, cert-manager, external-secrets, external-dns, ArgoCD)	Vendor	Semi-automated	Terraform (helm_release resources)
	Enable GKE Managed Prometheus, create Cloud Monitoring dashboards	Vendor	Semi-automated	Terraform, gcloud
	Configure Cloud Logging sinks and alerts	Vendor	Manual	gcloud, Terraform
	Deploy Binary Authorization or OPA Gatekeeper	Vendor	Semi-automated	Terraform, gcloud
	DNS delegation and NS record setup	Vendor	Manual	Cloud DNS, customer DNS registrar
App Deployment	Apply ArgoCD Project and Application manifests	Vendor	Manual	kubectl apply (ArgoCD CRDs)
	Initial sync and health validation	Vendor	Automated	ArgoCD auto-sync
	Keycloak config (realm, clients, IdP, RBAC, MFA)	Vendor	Manual	Keycloak admin console
	Populate secrets in GCP Secret Manager	Vendor	Manual	gcloud
	SSL certificate validation	Vendor	Manual	curl, openssl
	End-to-end smoke test	Vendor	Semi-automated	Test scripts, manual verification
	Load testing	Vendor	Semi-automated	k6, Locust, or similar
	Backup restore and failover drill	Vendor	Manual	gcloud, psql
UAT	User acceptance testing	Customer	Manual	Application UI
	Bug fixes and config adjustments	Vendor	Manual	Code changes, Terraform
	Customer sign-off	Customer	Manual	Approval gate
Go-Live	Customer UAT on production	Customer	Manual	Application UI
	SLO dashboards and alert routing validation	Joint	Manual	Cloud Monitoring
	Incident runbooks and escalation procedures	Vendor	Manual	Documentation
	Security assessment and vulnerability scan	Joint	Semi-automated	Scanner tools
	Rollback drill	Vendor	Manual	ArgoCD, Terraform, psql
	DNS cutover and final alerting	Vendor	Manual	Cloud DNS, Cloud Monitoring
	Go-live approval	Customer	Manual	Approval gate
Post Go-Live	Hypercare monitoring and on-call	Vendor	Manual + automated alerts	Cloud Monitoring, PagerDuty
Ongoing - Images	New image pushed to registry	Vendor CI	Automated	GitHub Actions, Artifact Registry
	Image Updater detects new tag (staging)	Automatic	Automated	ArgoCD Image Updater (polling)
	Tag written back to Git (staging branch)	Automatic	Automated	Image Updater Git write-back
	ArgoCD syncs new image to staging	Automatic	Automated	ArgoCD auto-sync
	Vendor validates staging deployment	Vendor	Manual	kubectl, dashboards
	Vendor opens promotion PR for production	Vendor	Manual	Git, GitHub
	Customer approves production promotion PR	Customer	Manual	GitHub PR review
	ArgoCD syncs approved image to production	Automatic	Automated	ArgoCD auto-sync
	Vendor validates production deployment	Vendor	Manual	kubectl, dashboards
Ongoing - Infra	Vendor proposes change via PR	Vendor	Manual	Git, GitHub
	IaC validation and image scanning on PR	Automatic	Automated	CI pipeline
	Customer reviews and approves PR	Customer	Manual	GitHub PR review
	Vendor merges and runs terraform apply	Vendor	Semi-automated	Git merge, Terraform CLI or CI/CD
	ArgoCD syncs app-layer changes (if any)	Automatic	Automated	ArgoCD auto-sync
	Deployment health validation	Vendor	Manual	kubectl, Cloud Monitoring

Customer-Managed Model

Phase	Action	Actor	Automation	Tooling
Prerequisites	Provision GCP project, billing alerts	Customer	Manual	GCP Console, gcloud
	Create VPC, subnets, IAM roles	Customer	Manual	GCP Console or Terraform
	Establish on-prem VPN connectivity	Customer	Manual	Cloud VPN, network appliance
Training	GCP fundamentals, Terraform, GKE, Helm, ArgoCD	Vendor-led	Manual	Workshops, hands-on labs
	Hands-on labs (dev environment deployment)	Customer	Manual (guided)	Terraform, kubectl, Helm
Planning	Architecture review, network design	Joint	Manual	Meetings, documentation
	Customer prepares terraform.tfvars	Customer	Manual	Text editor
	Vendor reviews customer config	Vendor	Manual	Code review
	Select vendor validation model	Joint	Manual	Decision gate
	Design sign-off	Customer	Manual	Approval gate
Infrastructure	Terraform plan review with vendor	Joint	Manual	Terraform plan output
	Terraform plan and apply (GKE, Cloud SQL, GCS, WI)	Customer IT	Semi-automated	Terraform CLI or CI/CD pipeline
	IaC validation (tflint, tfsec, checkov)	Customer IT	Automated	CI pipeline on PR
	Security hardening (PSS, NetworkPolicies)	Customer IT	Manual	kubectl, Terraform
	Cloud SQL post-deploy (extensions, users)	Customer IT	Manual	psql, gcloud
	Vendor validates staging infra	Vendor	Manual	Per validation model (A/B/C)
Add-ons	Deploy cluster add-ons via Terraform helm_release (nginx, cert-manager, external-secrets, external-dns, ArgoCD)	Customer IT	Semi-automated	Terraform (helm_release resources)
	Enable GKE Managed Prometheus, create Cloud Monitoring dashboards	Customer IT	Semi-automated	Terraform, gcloud
	Configure Cloud Logging sinks and alerts	Customer IT	Manual	gcloud, Terraform
	Deploy Binary Authorization or OPA Gatekeeper	Customer IT	Semi-automated	Terraform, gcloud
	DNS delegation and NS record setup	Customer IT	Manual	Cloud DNS
App Deployment	Apply ArgoCD Project and Application manifests	Customer IT	Manual	kubectl apply (ArgoCD CRDs)
	Initial sync and health validation	Customer IT	Automated	ArgoCD auto-sync
	Keycloak config (realm, clients, IdP, RBAC, MFA)	Customer IT	Manual	Keycloak admin console
	Secrets setup in GCP Secret Manager	Customer IT	Manual	gcloud
	End-to-end smoke test	Customer IT	Semi-automated	Test scripts, manual verification
	Load testing	Customer IT	Semi-automated	k6, Locust, or similar
	Backup restore and failover drill	Customer IT	Manual	gcloud, psql
	Vendor validates staging deployment	Vendor	Manual	Per validation model (A/B/C)
UAT	User acceptance testing	Customer	Manual	Application UI
	Bug fixes and config adjustments	Customer IT	Manual	Code changes, Terraform
	Customer sign-off	Customer	Manual	Approval gate
Go-Live	Customer UAT on production	Customer	Manual	Application UI
	SLO dashboards and alert routing validation	Customer	Manual	Cloud Monitoring
	Incident runbooks and escalation procedures	Customer	Manual	Documentation
	Security assessment and vulnerability scan	Customer	Semi-automated	Scanner tools
	Rollback drill	Customer IT	Manual	ArgoCD, Terraform, psql
	DNS cutover and final alerting	Customer IT	Manual	Cloud DNS, Cloud Monitoring
	Go-live approval	Customer	Manual	Approval gate
Post Go-Live	Hypercare monitoring (customer primary, vendor tier-2/3)	Joint	Manual + automated alerts	Cloud Monitoring, PagerDuty
	Knowledge transfer (advanced runbooks)	Vendor	Manual	Documentation, workshops
Ongoing - Images	New image pushed to registry	Vendor CI	Automated	GitHub Actions, Artifact Registry
	Image Updater detects new tag	Automatic	Automated	ArgoCD Image Updater (polling)
	Tag written back to Git	Automatic	Automated	Image Updater Git write-back
	ArgoCD syncs new image to staging	Automatic	Automated	ArgoCD auto-sync
	Customer validates staging	Customer	Manual	Application UI, dashboards
	Customer promotes to production via sync window	Customer	Manual	ArgoCD sync or Git merge
Ongoing - Infra	Vendor publishes release notes and Terraform changes	Vendor	Manual	Git, documentation
	IaC validation and image scanning	Automatic	Automated	CI pipeline
	Customer IT evaluates changes	Customer IT	Manual	Code review, documentation
	Customer applies Terraform to staging	Customer IT	Semi-automated	Terraform CLI or CI/CD
	Customer validates staging	Customer	Manual	Application UI, dashboards
	Customer applies to production via sync window	Customer IT	Semi-automated	Terraform CLI or CI/CD
	Customer validates production	Customer	Manual	Application UI, dashboards

Automation Level Definitions

Level	Definition	Examples
Automated	Runs without human intervention; triggers on events	ArgoCD auto-sync, Image Updater polling, CI pipeline on PR
Semi-automated	Human initiates; tooling executes	`terraform apply`, `helm install`, load test run
Manual	Human performs directly; requires judgment	Architecture review, Keycloak config, UAT, approval gates
Manual + automated alerts	Human monitors; system generates alerts	Hypercare period with Cloud Monitoring alerting

Recommendation

Vendor-managed is recommended for initial deployments:

Faster time to value — staging delivered ~3 weeks sooner
Lower deployment risk — vendor knows the Terraform module dependency chain, Cloud SQL post-deploy steps, Workload Identity binding topology, and ArgoCD sync policy nuances
WIF eliminates the security concern — scoped IAM roles, Cloud Audit Logs, instant revocation via pool deletion; no long-lived credentials
Lightweight ongoing model for customer — review a diff, approve, ArgoCD auto-syncs; ~hours not days per update cycle
Transition path exists — customer can move to self-managed later with condensed training against a working system (more effective than training before deployment)

Choose customer-managed when: the customer has a strong platform engineering team, regulatory constraints prohibit any external infrastructure access, or building deep GCP/K8s competency is a strategic goal.

Vendor Access: Workload Identity Federation (WIF)

For the vendor-managed model, WIF is recommended over traditional service account keys:

Aspect	Service Account Key	Workload Identity Federation
Credential type	Long-lived JSON key file	Short-lived OIDC tokens
Key rotation	Manual, error-prone	Automatic (token-based)
Revocation	Delete key, redeploy	Disable WIF pool (instant)
Audit trail	Cloud Audit Logs	Cloud Audit Logs + OIDC claims
CI/CD integration	Store key as secret	Native GitHub Actions OIDC
Blast radius	Key leak = full access until rotated	Token expires in minutes

Setup (customer responsibility):

Create a Workload Identity Pool in their GCP project
Add an OIDC provider (vendor's GitHub Actions or Google Workspace) with attribute constraints (repo, environment, branch)
Create separate deployment Service Accounts for staging and production with tightly scoped custom IAM roles (never use primitive roles like Editor or broad admin roles). Use custom roles with minimal permissions per phase:
- Infrastructure provisioning: roles/container.clusterAdmin (not container.admin), roles/cloudsql.editor (not cloudsql.admin), roles/storage.objectAdmin on specific buckets
- Workload Identity: roles/iam.serviceAccountUser scoped to specific SAs via IAM Conditions
- DNS: roles/dns.admin scoped to specific managed zones
- Secrets: roles/secretmanager.secretVersionManager (not secretmanager.admin)
- Networking: roles/compute.networkAdmin for VPC peering and Cloud NAT creation, roles/compute.routerAdmin for Cloud Router management; downgrade to roles/compute.networkUser on specific subnets for runtime workloads after infrastructure provisioning is complete
- Prefer custom roles with only the exact permissions required; use IAM Conditions to restrict by resource name, environment label, or time window where possible
Grant roles/iam.workloadIdentityUser on the SA to the WIF pool (SA impersonation pattern)
Share the WIF pool ID and project number with vendor
Schedule periodic access reviews (quarterly recommended)
Permission audits: Conduct quarterly reviews of SA permissions using IAM Recommender to identify and remove unused roles

Deployment Notes

VPC and Network Configuration

The Terraform modules reference a VPC by name. The existing repo defaults to the default VPC and subnet with auto-allocated secondary IP ranges for GKE pods/services. For customer deployments:

VPC: The customer's VPC name and subnet CIDR ranges must be specified in terraform.tfvars. The customer may use an existing shared VPC or a dedicated VPC — this must be resolved during the architecture review (Phase 1)
Subnets: Define secondary IP ranges for GKE pods and services explicitly (the repo currently uses auto-allocation, which should be replaced with planned CIDR ranges for production)
Cloud SQL: Connects via private IP over VPC peering, which requires the VPC to have Private Services Access configured (the vpc-peering module handles this)
Cloud DNS: The repo uses a shared managed zone pattern (data.google_dns_managed_zone.shared) referencing a pre-existing zone, not creating one per environment. For customer deployments, determine whether the customer provides an existing zone or a new one is created via the cloud-dns module

GKE Cluster Topology

Production clusters should be regional (not zonal) with multi-zone node pools for HA. The architecture review must define:

Regional cluster with control plane replicated across 3 zones
Node pool autoscaling ranges (min/max nodes per pool)
PodDisruptionBudgets for all critical workloads
HPA (Horizontal Pod Autoscaler) targets for application services
VPA (Vertical Pod Autoscaler) recommendations for right-sizing
Cluster Autoscaler configuration for node-level scaling

Disaster Recovery (Cross-Region)

The architecture review must clarify cross-region DR requirements based on RTO/RPO targets:

Single-region HA (default): Multi-zone regional GKE + Cloud SQL Regional HA covers zone-level failures; sufficient for most deployments
Cross-region DR (if required by RTO/RPO): Plan for cross-region Cloud SQL read replica promotion, GCS cross-region replication, and a passive standby GKE cluster in a secondary region
Decision criteria: If RTO < 1 hour and RPO < 5 minutes for regional outages, cross-region DR is recommended; otherwise, single-region HA with backup/restore is sufficient
DR drill: If cross-region is implemented, include a DR failover drill in Phase 8 (Go-Live) tasks
GitOps repository DR: The Git repository is the single source of truth for all infrastructure and application configuration. Ensure the Git hosting provider (e.g., GitHub) has redundancy. Additionally, configure a mirror repository (e.g., GitHub -> Cloud Source Repositories or a self-hosted GitLab) that can serve as a fallback if the primary Git provider experiences an outage. Include Git repository recovery in the DR drill.

Observability Stack (GCP-Native)

The observability stack uses GCP-managed services, eliminating self-managed Prometheus/Grafana/Loki deployments:

Metrics — GKE Managed Prometheus (GMP):

GMP is a built-in GKE feature (enable managed_prometheus on the cluster via monitoring_config { enable_components = ["SYSTEM_COMPONENTS"] managed_prometheus { enabled = true } } in the GKE Terraform module); no Helm charts or PVCs to manage
Implementation note: The existing GKE Terraform module uses the legacy monitoring_service attribute. This must be replaced with the monitoring_config block to enable GMP. The module update should be included as a prerequisite task in Phase 2 (Staging Infrastructure).
Runs a managed collection pipeline on each node; scrapes Prometheus-format metrics and writes to Cloud Monitoring
Fully PromQL-compatible — existing dashboards and alerting rules work without modification
Retention handled by Cloud Monitoring (free tier: 24 months for GCP metrics, custom metrics billed per sample ingested)
Resource overhead: GMP collection pods run as a DaemonSet with low resource footprint (~50-100MB per node); significantly less than self-managed Prometheus

Dashboards and Alerting — Cloud Monitoring:

Native SLO monitoring with burn-rate alerting
Custom dashboards via Cloud Monitoring console or Terraform (google_monitoring_dashboard)
Alerting policies with notification channels (PagerDuty, Slack, email, Pub/Sub)
Uptime checks for external endpoint monitoring
No Grafana deployment needed (optional: deploy Grafana with Cloud Monitoring as a data source if advanced visualization is required)

Logging — Cloud Logging:

Enabled by default on GKE; no agents to deploy
Cost management: configure log exclusion filters for high-volume debug/trace logs; route retained logs to GCS or BigQuery at lower cost
See "Centralized Logging (Cloud Logging)" section for full details

Resource planning:

Keycloak: resource-intensive; define CPU/memory requests/limits explicitly
Dedicated node pools: consider for Keycloak in production to isolate from application pods
Cloud Monitoring costs: estimate custom metrics volume (GMP ingestion) during load testing; use metrics exclusion filters for high-cardinality labels
Performance validation: validate GMP collection overhead during load testing

Cloud SQL HA

The Terraform module defaults to ZONAL availability. For production, terraform.tfvars must explicitly set cloudsql_availability_type = "REGIONAL" to enable automatic failover. The failover drill before go-live validates connection retry behavior, PgBouncer reconnection (if applicable), and application recovery within the defined RTO.

Capacity Planning

A capacity planning task should be completed during the architecture review (Phase 1) and validated during load testing:

Cloud SQL sizing: Define vCPU/memory tier based on expected concurrent connections and query volume; the module defaults to max_connections = 100 which may be insufficient for production — adjust via database flags in terraform.tfvars
Connection pooling: Deploy PgBouncer as a sidecar or standalone deployment in GKE (Cloud SQL does not provide a native connection pooler). Alternatively, use Cloud SQL Auth Proxy with --max-connections flag. Define max pool size per service based on expected concurrency.
GKE node autoscaler bounds: Set min_node_count and max_node_count based on pod resource requests and expected workload; validate that GCP project quotas (vCPUs, IP addresses, persistent disks) can accommodate the maximum node count
GCP quota checks: Verify regional quotas for Compute Engine, Cloud SQL, GCS, and networking before deployment; request quota increases proactively if needed
Validation: Load testing results must confirm that the capacity plan supports 2x expected peak load with headroom

Load Testing Criteria

Load tests must have defined pass/fail criteria agreed during the architecture review:

Target throughput (RPS) and latency percentiles (p50, p95, p99)
Cloud SQL connection pool limits and saturation behavior
GKE node autoscaling response under load
Error rate thresholds (e.g., < 0.1% 5xx during steady state)
Capacity headroom validation (sustain 2x expected peak)

IaC Validation Pipeline

Before any terraform apply, run automated validation:

terraform validate — syntax and internal consistency
tflint — Terraform linting with GCP ruleset
tfsec / checkov — security policy scanning (no public IPs, encryption at rest, etc.)
These checks should be integrated into the CI pipeline for ongoing Terraform PRs

Supply Chain and Runtime Security

Image scanning: all container images scanned for CVEs before deployment (Artifact Registry vulnerability scanning or equivalent)
Chart provenance: Helm chart signatures verified against known publishers
Admission control: GKE Binary Authorization or OPA Gatekeeper policies to enforce only signed/scanned images are deployed -- this is a mandatory pre-production milestone and must be enabled and validated in staging before production deployment begins
Policy enforcement scheduling: Binary Authorization / OPA Gatekeeper must be deployed and tested as an explicit task in Phase 3 (staging add-ons) and Phase 7 (production add-ons), with validation that non-compliant images are blocked
Exception management: Define an exceptions process for policy violations (e.g., break-glass procedure for emergency deployments with post-hoc review)
SAST/DAST: application security testing is the responsibility of the application CI pipeline (separate from infrastructure deployment)

ArgoCD Image Updater Auditability

The Image Updater is configured with Git write-back mode — when it detects a new image tag in the container registry, it commits the updated tag back to the Git repository before ArgoCD syncs. This preserves the GitOps audit trail: every deployed image version has a corresponding Git commit. The ArgoCD Application manifests must include Image Updater annotations specifying which images to watch, the update strategy (semver, latest, digest), and the write-back target branch.

Environment separation for staging-first updates: The existing repo has all ArgoCD Applications tracking the same main branch with identical auto-sync policies. For customer deployments with staging-first gating, the ArgoCD Applications must be configured differently per environment:

Staging Application: Include Image Updater annotations (argocd-image-updater.argoproj.io/image-list, argocd-image-updater.argoproj.io/write-back-method: git) so new images are automatically detected and synced
Production Application: Do NOT include Image Updater annotations — production image updates require a promotion PR that updates the image tag in the production Helm values file, reviewed and approved before ArgoCD syncs
This configuration is applied during the "Apply ArgoCD Project and Application with Image Updater annotations" task in Phase 4 (vendor-managed) or Phase 6 (customer-managed)

Centralized Logging (Cloud Logging)

GKE clusters have Cloud Logging enabled by default — system and workload logs are collected automatically via the GKE logging agent (Fluent-bit-based) with no additional deployment required. This eliminates the need to deploy and manage a self-hosted logging stack (e.g., Loki + Promtail):

Zero deployment overhead: Cloud Logging is a managed service; no Helm charts, persistent volumes, or capacity planning for log storage
Native GCP integration: Logs Explorer, Log Analytics (BigQuery-backed), Error Reporting, Cloud Trace — all integrated with IAM
Log Router: Route logs to BigQuery (for analytics), GCS (for long-term archival), or Pub/Sub (for SIEM integration) via configurable sinks
Log retention: Default 30 days in Cloud Logging; extend via Log Router sinks to GCS (cold) or BigQuery (queryable) for compliance requirements
Log-based metrics and alerting: Create custom metrics from log entries and configure alerting policies in Cloud Monitoring — no separate alerting stack needed
Access control: IAM-based permissions (roles/logging.viewer, roles/logging.privateLogViewer) scoped by project or log bucket
Cost management: Configure log exclusion filters to drop high-volume debug/trace logs before ingestion; use log buckets with different retention periods for cost optimization
When to consider Loki instead: Only if multi-cloud portability is a hard requirement (i.e., the same deployment must run on AWS/Azure without GCP services)

Terraform State Protection

Terraform state buckets must be configured with the following protections:

Object versioning: Enabled on the GCS bucket to allow state recovery from accidental corruption or deletion
State locking: Use GCS-native state locking (default with gcs backend) to prevent concurrent modifications
Bucket-level access: Uniform bucket-level access with IAM-only permissions (no ACLs)
Encryption: Customer-managed encryption keys (CMEK) for state files containing infrastructure details
Backup: Cross-region replication for disaster recovery of state files (optional, based on data residency and compliance requirements; if disabled due to regulatory constraints, require versioning plus periodic encrypted backups to a secondary bucket)

GKE Private Cluster Configuration

All GKE clusters (staging and production) must be configured as private clusters:

Private nodes: Disable public IP addresses on nodes (enable_private_nodes = true)
Private endpoint: Disable the public control plane endpoint (enable_private_endpoint = true) or restrict via master authorized networks
Master authorized networks: Limit control plane access to VPN/bastion CIDR ranges only
Cloud NAT: Required for outbound internet access from private nodes (already included in Terraform)
Access path: Document the authorized access path (e.g., VPN -> bastion -> kubectl, or Cloud Shell with Private Google Access)
Prerequisite task: A bastion host or VPN-based kubectl access path must be provisioned and validated as a Phase 0 prerequisite (task cust_access) before any cluster operations can begin. This task validates kubectl get nodes succeeds through the authorized path and confirms master authorized networks are correctly configured.
This must be validated during the architecture review (Phase 1) and enforced via tfsec rules

Vendor Validation Model (Customer-Managed)

To resolve the contradiction between "no vendor access to customer GCP" and vendor validation requirements, define one of the following validation models during the architecture review:

Option A: Time-bound read-only access -- Customer grants temporary Viewer IAM role with expiration (e.g., 4-hour session) for vendor validation, revoked immediately after
Option B: Customer-provided evidence -- Customer provides validation artifacts (screenshots, CLI output, health check reports) using a standardized checklist
Option C: Supervised screen-share -- Vendor observes customer-run validation commands via video call, providing real-time guidance
The chosen model must be documented in the architecture review deliverables and referenced in the Gantt chart as a prerequisite for each vendor validation task

GKE Application-Layer Secrets Encryption

Kubernetes Secrets stored in etcd must be encrypted at the application layer using CMEK:

Enable Application-layer Secrets Encryption on the GKE cluster using a Cloud KMS key (database_encryption { state = "ENCRYPTED", key_name = "projects/.../cryptoKeys/..." } in Terraform)
Alternative: Use the Secret Manager CSI Driver to mount secrets directly from GCP Secret Manager as volumes, bypassing etcd storage entirely. This eliminates the risk of plaintext secrets in etcd backups.
Namespace-level RBAC: Restrict get/list/watch on Secrets resources to only the service accounts that need them (default ClusterRole grants are too broad)
Implementation note: The GKE Terraform module must be extended with a database_encryption block. The KMS key must be created in the customer's project and granted roles/cloudkms.cryptoKeyEncrypterDecrypter to the GKE service agent.

ArgoCD Image Updater Token Security

The Git write-back token used by ArgoCD Image Updater must follow these security controls:

Least privilege: Use a GitHub App or fine-grained personal access token (PAT) scoped to the specific repository with contents: write only
Secure storage: Store the token in GCP Secret Manager (not as a Kubernetes Secret) and sync via External Secrets Operator
Rotation: Enforce token rotation via a CronJob that refreshes the token from Secret Manager at a defined cadence (e.g., every 30 days)
Audit: Log all Git write-back commits and monitor for unexpected commit patterns
Signed commits: Consider enforcing GPG-signed commits for Image Updater write-backs to prevent commit spoofing
CronJob monitoring: Configure Cloud Monitoring alerts for token rotation CronJob failures (kube_job_status_failed metric) and ArgoCD sync state degradation (argocd_app_sync_status{sync_status="OutOfSync"} for prolonged periods). Alert on: CronJob not completing within expected window, consecutive CronJob failures, and ArgoCD unable to push write-back commits.

ArgoCD Notifications

The existing repo includes ArgoCD notification annotations per environment (Slack channels for deployment events, health degradation, and sync failures). For customer deployments:

Adapt notification channels: Replace vendor Slack channels with the customer's notification targets (Slack, PagerDuty, email, or webhook)
Events to configure: on-deployed, on-health-degraded, on-sync-failed, on-sync-running (production should include all four)
ArgoCD metrics: Enable controller and repo-server metrics in staging and production (disabled by default in dev for cost optimization)
Implementation: Notification annotations are set on the ArgoCD Application manifests; notification templates and triggers are configured in the ArgoCD ConfigMap

Cloud SQL Query Insights

The existing Terraform module enables Cloud SQL Query Insights by default:

query_insights_enabled = true
query_plans_per_minute = 5
query_string_length = 1024
record_application_tags = true

This provides built-in query performance analysis in the GCP Console without additional tooling. For customer deployments, validate that Query Insights is enabled and review the query plan sample rate during load testing.

Centralized Audit Logging and Security Alerting

Both deployment models must include centralized audit log management:

Cloud Audit Logs: Ensure Admin Activity, Data Access, and System Event logs are enabled for all GCP services
Log sinks: Configure a Cloud Logging sink to export audit logs to a dedicated logging project or BigQuery for long-term retention
Retention: Minimum 1 year retention for compliance; 90 days in Cloud Logging, remainder in cold storage (GCS or BigQuery)
Security alerts: Configure alerting rules for critical events:
- IAM policy changes and role grants
- ArgoCD admin login attempts and RBAC modifications
- Secret Manager access patterns and unusual secret reads
- GKE control plane access from unexpected IPs
SIEM integration: If the customer uses a SIEM, configure log forwarding via Pub/Sub or direct integration

GKE and Add-on Upgrade Lifecycle

Define a maintenance and patching policy for post-go-live operations:

GKE version upgrades: Select an upgrade channel (rapid/regular/stable) based on the customer's risk profile and change-management requirements during the architecture review; use a staging-first approach with staging upgraded 1 week before production
Node OS image updates: Enable auto-upgrade for node pools with maintenance windows configured for off-peak hours
Add-on versioning: Track Helm chart versions for all add-ons (ArgoCD, cert-manager, external-secrets, etc.) and schedule quarterly update reviews
Compatibility testing: Validate add-on compatibility in staging before production upgrades
Rollback criteria: Define rollback triggers (e.g., pod crash rate > 5%, health check failures, API errors) and procedures
Communication: Establish a maintenance notification process with the customer (minimum 48-hour advance notice for production changes)

Database Migration Strategy

Define migration tooling and procedures for safe schema changes and rollback:

Migration tooling: Use Flyway or Liquibase for versioned, repeatable database migrations with a clear migration history
Backward compatibility: Enforce backward-compatible schema changes (additive only) to enable safe rollback; breaking changes require a multi-phase migration plan
Pre-migration validation: Run migration dry-runs in staging with production-like data volumes before applying to production
Post-migration checks: Automated verification of schema state, data integrity, and application health after each migration
PITR alignment: Validate that Cloud SQL point-in-time recovery (PITR) can restore to a state before migration within the defined RTO/RPO
Rollback rehearsal: Include migration rollback as part of the pre-go-live rollback drill, testing both Flyway/Liquibase undo and PITR recovery paths

Phase-Level Backout Plans

Each deployment phase should have a documented backout procedure:

Infrastructure phases: terraform destroy for the specific module, or terraform apply with previous state
Add-on phases: helm uninstall for the specific release
Application phases: ArgoCD sync to previous Git commit, or argocd app rollback
Database phases: point-in-time recovery from automated Cloud SQL backups
Go/no-go criteria: defined at each phase boundary to prevent proceeding with a broken foundation. Specific gates:
- After IaC validation: all tfsec/tflint checks pass with zero critical findings
- After smoke test: all services respond to health checks, end-to-end user flow completes
- After load test: meets defined p95 latency, throughput, and error rate thresholds
- After security assessment: no critical or high vulnerabilities unmitigated
- These gates should be represented as milestones in the Gantt chart with explicit pass/fail criteria documented in the architecture review deliverables