Skip to content

Instantly share code, notes, and snippets.

@vishnu2kmohan
Last active February 26, 2026 23:31
Show Gist options
  • Select an option

  • Save vishnu2kmohan/0b2fd1133a2a0c2616a61adac6590325 to your computer and use it in GitHub Desktop.

Select an option

Save vishnu2kmohan/0b2fd1133a2a0c2616a61adac6590325 to your computer and use it in GitHub Desktop.
Project Plan Options

Customer Deployment Project Plans

Context

Two project plan templates (as Mermaid Gantt charts) for deploying the product at customer sites. The two models differ in who manages the deployment lifecycle:

  1. Vendor-Managed — The vendor provisions, deploys, and operates the platform; the customer approves changes
  2. Customer-Managed — Customer IT handles provisioning and operations; the vendor trains and advises

Both models deploy Staging + Production environments (separate GKE clusters each). The customer always provides a GCP project, VPC, and on-prem connectivity as prerequisites.

Vendor access (vendor-managed model): Workload Identity Federation (WIF) — no long-lived service account keys, full Cloud Audit Logs trail, instant revocation via WIF pool deletion. Integrates natively with GitHub Actions OIDC for automated Terraform runs.


Plan 1: Vendor-Managed Deployment (~9 Weeks)

The vendor executes all infrastructure provisioning, cluster add-on deployment, ArgoCD configuration, and application deployment. Customer responsibilities are limited to prerequisites, approvals, and UAT.

gantt
    title Vendor-Managed Deployment
    dateFormat  YYYY-MM-DD
    axisFormat  %b %d

    section Phase 0 - Prerequisites [Customer]
    Provision GCP project and set billing budget alerts      :cust_gcp, 2026-03-02, 3d
    Create VPC and grant subnet access                       :cust_vpc, after cust_gcp, 2d
    Establish on-prem connectivity via VPN                   :cust_vpn, after cust_gcp, 5d
    Configure WIF pool and OIDC provider                     :cust_wif, after cust_vpc, 3d
    Create staging deployment SA and assign scoped IAM roles  :cust_sa_stg, after cust_wif, 1d
    Create production deployment SA and assign scoped IAM roles :cust_sa_prod, after cust_sa_stg, 1d
    Grant workloadIdentityUser on both SAs to WIF pool       :cust_iam, after cust_sa_prod, 1d
    Vendor validates WIF and SA impersonation for both SAs   :cust_wifval, after cust_iam, 1d
    Provision bastion host or validate VPN kubectl access path :cust_access, after cust_vpn, 2d
    Prerequisites complete                                   :milestone, prereq_done, after cust_wifval cust_access, 0d

    section Phase 1 - Planning and Design [Joint]
    Kickoff and requirements gathering                       :kickoff, after prereq_done, 2d
    Architecture review - regional GKE, HA, SLOs, RTO-RPO   :arch, after kickoff, 2d
    Network design - VPC, subnets, NAT, firewall, peering   :netdesign, after kickoff, 2d
    Environment config - terraform.tfvars for stg and prod   :envconfig, after arch, 1d
    Customer sign-off on design                              :milestone, design_approved, after envconfig netdesign, 0d

    section Phase 2 - Staging Infrastructure [Vendor]
    Create Terraform state bucket for staging                :stg_state, after design_approved, 1d
    Create KMS key and grant service agent roles             :stg_kms, after stg_state, 1d
    Terraform - GCP APIs, VPC peering, Cloud NAT             :stg_net, after stg_kms, 1d
    Terraform - regional GKE cluster with monitoring_config and database_encryption :stg_gke, after stg_net, 2d
    Terraform - Cloud SQL PostgreSQL 15 private IP           :stg_sql, after stg_net, 1d
    Terraform - GCS buckets                                  :stg_gcs, after stg_net, 1d
    Terraform - Workload Identity SAs and K8s bindings       :stg_wi, after stg_gke, 1d
    Validate GMP metrics collection and Secrets encryption   :stg_gmpval, after stg_gke, 1d
    Cloud SQL post-deploy - extensions, users, secrets       :stg_sqlpost, after stg_sql stg_gke, 1d
    Security hardening - PSS, NetworkPolicies, Shielded Nodes :stg_security, after stg_gke, 1d
    IaC validation - terraform validate, tflint, tfsec       :stg_iactest, after stg_security, 1d
    IaC gate - zero critical tfsec findings                  :milestone, stg_iac_gate, after stg_iactest, 0d

    section Phase 3 - Staging Add-ons [Vendor]
    Deploy nginx-ingress and Cloud Armor WAF policy          :stg_nginx, after stg_wi, 1d
    Deploy cert-manager v1.14.0 and ClusterIssuers           :stg_cert, after stg_nginx, 1d
    Deploy external-secrets v0.9.11                          :stg_es, after stg_wi, 1d
    Deploy external-dns v1.14.0                              :stg_edns, after stg_cert, 1d
    Deploy ArgoCD v5.51.6                                    :stg_argo, after stg_cert stg_nginx, 1d
    Deploy Image Updater with Git write-back and token CronJob :stg_imgupd, after stg_argo, 1d
    Enable GKE Managed Prometheus and Cloud Monitoring        :stg_monitor, after stg_argo, 1d
    Configure Cloud Logging sinks, retention, and alerts     :stg_audit, after stg_monitor, 1d
    Deploy Binary Authorization or OPA Gatekeeper policies   :stg_binauth, after stg_argo, 1d
    Create K8s namespaces and service accounts               :stg_ns, after stg_imgupd, 1d
    DNS delegation and NS record verification                :stg_dns, after stg_edns, 1d

    section Phase 4 - Staging App Deployment [Vendor]
    Configure ArgoCD repo access for product                  :stg_repo, after stg_imgupd, 1d
    Apply ArgoCD Project and Application with Image Updater annotations :stg_app, after stg_repo stg_ns, 1d
    Initial sync and health validation                       :stg_sync, after stg_app, 1d
    Keycloak config - realm, clients, IdP, RBAC, MFA        :stg_kc, after stg_sync, 2d
    ArgoCD RBAC hardening and admin account controls         :stg_argohard, after stg_sync, 1d
    Populate GCP Secret Manager secrets with CMEK            :stg_secrets, after stg_sync, 1d
    Secrets audit - no secrets in Git, least-privilege       :stg_secaudit, after stg_secrets, 1d
    SSL certificate validation                               :stg_ssl, after stg_dns stg_sync, 1d
    Image scanning and chart provenance verification         :stg_imgscan, after stg_sync, 1d
    End-to-end smoke test all services                       :stg_smoke, after stg_kc stg_ssl stg_secaudit stg_imgscan stg_argohard, 2d
    Smoke test gate - all health checks pass                 :milestone, stg_smoke_gate, after stg_smoke, 0d
    Load testing against defined SLOs and capacity targets   :stg_loadtest, after stg_smoke_gate, 2d
    Load test gate - meets p95 latency and error thresholds  :milestone, stg_load_gate, after stg_loadtest, 0d
    Backup restore and Cloud SQL failover drill              :stg_backup, after stg_smoke_gate, 1d
    Staging deployment complete                              :milestone, stg_done, after stg_backup stg_load_gate, 0d

    section Phase 5 - Staging UAT [Customer and Vendor]
    Customer UAT on staging                                  :stg_uat, after stg_done, 5d
    Bug fixes and configuration adjustments                  :stg_fix, after stg_uat, 2d
    Customer staging sign-off                                :milestone, stg_signoff, after stg_fix, 0d

    section Phase 6 - Production Infrastructure [Vendor]
    Create Terraform state bucket for prod                   :prod_state, after stg_signoff, 1d
    Create KMS key for prod and grant service agent roles    :prod_kms, after prod_state, 1d
    Terraform - GCP APIs, VPC peering, Cloud NAT             :prod_net, after prod_kms, 1d
    Terraform - regional GKE cluster with monitoring_config and database_encryption :prod_gke, after prod_net, 2d
    Terraform - Cloud SQL PostgreSQL 15 Regional HA          :prod_sql, after prod_net, 1d
    Terraform - GCS buckets and Workload Identity            :prod_misc, after prod_gke, 1d
    Validate GMP metrics collection and Secrets encryption   :prod_gmpval, after prod_gke, 1d
    Cloud SQL post-deploy                                    :prod_sqlpost, after prod_sql prod_gke, 1d
    Security hardening - PSS, NetworkPolicies, Shielded Nodes :prod_security, after prod_gke, 1d
    IaC validation - terraform validate, tflint, tfsec       :prod_iactest, after prod_security, 1d
    IaC gate - zero critical tfsec findings                  :milestone, prod_iac_gate, after prod_iactest, 0d

    section Phase 7 - Production Add-ons and Apps [Vendor]
    Deploy nginx-ingress with Cloud Armor WAF policy          :prod_addons1, after prod_misc, 2d
    Deploy cert-manager, external-secrets, external-dns      :prod_addons1b, after prod_addons1, 1d
    Deploy ArgoCD                                            :prod_addons2, after prod_addons1b, 1d
    Deploy Image Updater with Git write-back and token CronJob :prod_imgupd, after prod_addons2, 1d
    Enable GKE Managed Prometheus and Cloud Monitoring        :prod_monitor_deploy, after prod_addons2, 1d
    Configure Cloud Logging sinks, retention, and alerts     :prod_audit, after prod_monitor_deploy, 1d
    Deploy Binary Authorization or OPA Gatekeeper policies   :prod_binauth, after prod_addons2, 1d
    Create K8s namespaces and service accounts               :prod_ns, after prod_imgupd, 1d
    DNS delegation for prod                                  :prod_dns, after prod_imgupd, 1d
    ArgoCD repo, project, app with Image Updater annotations :prod_argo, after prod_ns, 1d
    Initial sync and health validation                       :prod_sync, after prod_argo, 1d
    Secrets setup with CMEK and rotation                     :prod_secrets, after prod_sync, 1d
    Image scanning and chart provenance verification         :prod_imgscan, after prod_sync, 1d
    Keycloak config - realm, clients, IdP, RBAC, MFA        :prod_kc, after prod_secrets prod_dns, 2d
    ArgoCD RBAC hardening and admin account controls         :prod_argohard, after prod_sync, 1d
    End-to-end smoke test                                    :prod_smoke, after prod_kc prod_argohard, 2d
    Smoke test gate - all health checks pass                 :milestone, prod_smoke_gate, after prod_smoke, 0d
    Backup restore and Cloud SQL HA failover drill           :prod_backup, after prod_smoke_gate, 1d
    Load testing against defined SLOs and capacity targets   :prod_loadtest, after prod_smoke_gate, 2d
    Load test gate - meets p95 latency and error thresholds  :milestone, prod_load_gate, after prod_loadtest, 0d
    Production deployment complete                           :milestone, prod_done, after prod_backup prod_load_gate, 0d

    section Phase 8 - Go-Live [Joint]
    Customer UAT on production                               :prod_uat, after prod_done, 3d
    SLO dashboards and alert routing validation              :prod_monitor, after prod_done, 2d
    Incident runbooks and escalation procedures              :prod_runbooks, after prod_done, 3d
    Security assessment and vulnerability scan               :prod_secassess, after prod_done, 2d
    Security gate - no critical or high vulnerabilities      :milestone, prod_sec_gate, after prod_secassess, 0d
    Rollback drill - ArgoCD, Terraform state, DB migration   :prod_rollback, after prod_done, 2d
    Cross-region DR failover drill (if DR selected)          :prod_dr_drill, after prod_rollback, 2d
    GitOps repository failover test (if DR selected)         :prod_git_dr, after prod_dr_drill, 1d
    Go-live approval                                         :milestone, golive, after prod_uat prod_sec_gate prod_monitor prod_runbooks prod_git_dr, 0d
    Go-live cutover - DNS and final alerting                 :cutover, after golive, 1d
    Go-live complete                                         :milestone, live, after cutover, 0d

    section Phase 9 - Post Go-Live [Vendor]
    Hypercare - vendor-monitored, on-call                    :hypercare, after live, 10d
    Handover to steady-state support                         :milestone, steady, after hypercare, 0d

    section Ongoing - App Image Updates [Staging Auto, Prod Gated]
    New image pushed to container registry                   :upd_img_push, after steady, 1d
    Image Updater detects new tag in staging                 :upd_img_detect, after upd_img_push, 0d
    Image Updater writes tag back to Git (staging branch)    :upd_img_write, after upd_img_detect, 0d
    ArgoCD syncs new image to staging cluster                :upd_img_sync_stg, after upd_img_write, 0d
    Vendor validates staging deployment health               :upd_img_validate_stg, after upd_img_sync_stg, 1d
    Vendor opens promotion PR for production                 :upd_img_promote_pr, after upd_img_validate_stg, 1d
    Customer approves production promotion PR                :upd_img_approve, after upd_img_promote_pr, 1d
    ArgoCD syncs approved image to production                :upd_img_sync_prod, after upd_img_approve, 0d
    Vendor validates production deployment health            :upd_img_validate_prod, after upd_img_sync_prod, 1d
    Image update complete                                    :milestone, upd_img_done, after upd_img_validate_prod, 0d

    section Ongoing - Infra and Config Changes [Vendor-Managed]
    Vendor proposes change via protected branch PR           :upd_propose, after upd_img_done, 2d
    Image scanning and chart provenance verification         :upd_scan, after upd_propose, 1d
    Customer reviews and approves with required reviewers    :upd_approve, after upd_scan, 2d
    Vendor merges within ArgoCD sync window                  :upd_deploy, after upd_approve, 1d
    Vendor validates deployment health                       :upd_validate, after upd_deploy, 1d
    Update cycle complete                                    :milestone, upd_done, after upd_validate, 0d
Loading

Vendor-Managed: Key Characteristics

  • Vendor owns execution of all Terraform and application deployment tasks
  • Two-layer deployment: Terraform deploys cluster add-ons via helm_release resources (nginx, cert-manager, external-secrets, external-dns, ArgoCD); ArgoCD then manages only the application workloads via GitOps
  • Customer touchpoints: prerequisites (incl. billing budget alerts), design sign-off, UAT (x2), go-live approval
  • GKE topology: regional cluster, multi-zone node pools, HPA/VPA autoscaling, PodDisruptionBudgets
  • Cloud SQL: PostgreSQL 15, private IP via VPC peering, Regional HA for production, failover drill before go-live
  • Security hardening: Cloud Armor WAF with pre-configured OWASP Top 10 rulesets (SQL injection, XSS, RCE, LFI) and rate-limiting policies applied to the nginx-ingress backend service; Pod Security Standards (Restricted profile); NetworkPolicies (default-deny with explicit allow rules); Keycloak RBAC with MFA; ArgoCD admin controls; image admission policies; GKE Application-layer Secrets Encryption with CMEK
  • Observability (all GCP-native): GKE Managed Prometheus for metrics collection (PromQL-compatible, stored in Cloud Monitoring); Cloud Monitoring for dashboards, SLOs, and alerting; Cloud Logging for centralized log aggregation (enabled by default on GKE); Cloud Audit Logs for security and compliance — no self-managed observability stack to operate
  • App image updates: staging is automatic via ArgoCD Image Updater with Git write-back for auditability; production requires a promotion PR approved by the customer before ArgoCD syncs — new images are validated in staging first, then promoted via a gated workflow
  • Infra and config changes: vendor submits PR → IaC validation (tflint, tfsec) → image scan + provenance check → customer reviews diff → vendor merges within ArgoCD sync window
  • Rollback: validated drill covering ArgoCD rollback, Terraform state recovery, and DB migration backout using Flyway/Liquibase with backward-compatible migrations
  • WIF access: tightly scoped custom IAM roles via SA impersonation, full audit trail with log sinks and security alerting, no long-lived credentials
  • Supply chain security: Binary Authorization / OPA Gatekeeper enforced as mandatory pre-production milestone with exception management
  • GKE clusters: private clusters with master authorized networks, Shielded Nodes (requires shielded_instance_config { enable_secure_boot = true, enable_integrity_monitoring = true } in the GKE node pool Terraform config)
  • Private endpoint policy: The architecture review must decide whether to fully disable the public control plane endpoint (enable_private_endpoint = true) or restrict access via master authorized networks only. The existing Terraform module defaults to enable_private_endpoint = false; update to true if the customer's security policy requires it and a validated bastion/VPN access path is confirmed.
  • Maintenance lifecycle: defined GKE upgrade cadence, add-on versioning, and patching policy for post-go-live
  • Deliverables: architecture diagram, runbooks, access matrix, SLO definitions, maintenance schedule

Plan 2: Customer-Managed Deployment (~13 Weeks)

Customer IT performs all provisioning and operations. The vendor provides training, documentation, review checkpoints, and validation at key milestones.

gantt
    title Customer-Managed Deployment
    dateFormat  YYYY-MM-DD
    axisFormat  %b %d

    section Phase 0 - Prerequisites [Customer]
    Provision GCP project and set billing budget alerts      :cust_gcp, 2026-03-02, 3d
    Create VPC and subnets                                   :cust_vpc, after cust_gcp, 2d
    Establish on-prem connectivity via VPN                   :cust_vpn, after cust_gcp, 5d
    Set up IAM roles for customer IT team                    :cust_iam, after cust_vpc, 2d
    Provision bastion host or validate VPN kubectl access path :cust_access, after cust_vpn, 2d
    Prerequisites complete                                   :milestone, prereq_done, after cust_iam cust_access, 0d

    section Phase 1 - Training GCP and Terraform [Vendor-led]
    GCP fundamentals - IAM, VPC, GKE                        :train_gcp, after prereq_done, 2d
    Terraform modules walkthrough                            :train_tf, after train_gcp, 2d
    Terraform state management and best practices            :train_state, after train_tf, 1d
    Hands-on lab - deploy dev environment guided             :train_lab1, after train_state, 2d
    Training checkpoint                                      :milestone, train1_done, after train_lab1, 0d

    section Phase 2 - Training Kubernetes and Add-ons [Vendor-led]
    GKE cluster operations and node management               :train_gke, after train1_done, 1d
    Helm add-ons walkthrough                                 :train_helm, after train_gke, 2d
    Workload Identity and service account bindings           :train_wi, after train_helm, 1d
    Cloud SQL operations and post-deployment steps           :train_sql, after train_gke, 1d
    Training checkpoint                                      :milestone, train2_done, after train_wi train_sql, 0d

    section Phase 3 - Training ArgoCD and GitOps [Vendor-led]
    ArgoCD concepts, architecture, RBAC                      :train_argo, after train2_done, 1d
    ArgoCD applications, projects, sync policies             :train_argoapp, after train_argo, 1d
    GitOps workflow - CICD and ArgoCD integration            :train_gitops, after train_argoapp, 1d
    Monitoring, alerting, troubleshooting runbooks           :train_ops, after train_gitops, 1d
    Hands-on lab - full stack deployment guided              :train_lab2, after train_ops, 2d
    All training complete                                    :milestone, train_done, after train_lab2, 0d

    section Phase 4 - Planning and Design [Joint]
    Architecture review - regional GKE, HA, SLOs, RTO-RPO   :arch, after train_done, 2d
    Network design - VPC, subnets, NAT, firewall, peering   :netdesign, after arch, 1d
    Customer prepares terraform.tfvars                       :envconfig, after arch, 2d
    Vendor reviews customer config                           :vendorreview, after envconfig, 1d
    Select vendor validation model (access, evidence, or screen-share) :valmodel, after vendorreview, 1d
    Design sign-off                                          :milestone, design_approved, after valmodel netdesign, 0d

    section Phase 5 - Staging Infrastructure [Customer IT]
    Create Terraform state bucket for staging                :stg_state, after design_approved, 1d
    Create KMS key and grant service agent roles             :stg_kms, after stg_state, 1d
    Terraform plan review with vendor                        :stg_planrev, after stg_kms, 1d
    Terraform - GCP APIs, VPC peering, Cloud NAT             :stg_net, after stg_planrev, 2d
    Terraform - regional GKE cluster with monitoring_config and database_encryption :stg_gke, after stg_net, 2d
    Terraform - Cloud SQL PostgreSQL 15 private IP           :stg_sql, after stg_net, 2d
    Terraform - GCS buckets                                  :stg_gcs, after stg_net, 1d
    Terraform - Workload Identity                            :stg_wi, after stg_gke, 1d
    Validate GMP metrics collection and Secrets encryption   :stg_gmpval, after stg_gke, 1d
    Cloud SQL post-deploy - extensions and users             :stg_sqlpost, after stg_sql stg_gke, 1d
    Security hardening - PSS, NetworkPolicies, Shielded Nodes :stg_security, after stg_gke, 1d
    IaC validation - terraform validate, tflint, tfsec       :stg_iactest, after stg_security, 1d
    IaC gate - zero critical tfsec findings                  :milestone, stg_iac_gate, after stg_iactest, 0d
    Vendor validates staging infra                           :stg_validate, after stg_wi stg_sqlpost stg_gcs stg_iac_gate stg_gmpval, 1d

    section Phase 6 - Staging Add-ons and Apps [Customer IT]
    Deploy nginx-ingress with Cloud Armor WAF policy          :stg_addons1, after stg_validate, 2d
    Deploy external-secrets and external-dns                 :stg_addons2, after stg_addons1, 1d
    Deploy ArgoCD                                            :stg_argo, after stg_addons1, 1d
    Deploy Image Updater with Git write-back and token CronJob :stg_imgupd, after stg_argo, 1d
    Enable GKE Managed Prometheus and Cloud Monitoring        :stg_monitor, after stg_argo, 1d
    Configure Cloud Logging sinks, retention, and alerts     :stg_audit, after stg_monitor, 1d
    Deploy Binary Authorization or OPA Gatekeeper policies   :stg_binauth, after stg_argo, 1d
    Create K8s namespaces and service accounts               :stg_ns, after stg_imgupd, 1d
    DNS delegation and NS record verification                :stg_dns, after stg_addons2, 2d
    Configure ArgoCD - repo, project, app with Image Updater annotations :stg_argoconf, after stg_imgupd stg_ns, 2d
    Initial sync and health validation                       :stg_sync, after stg_argoconf, 1d
    Secrets setup with CMEK and rotation                     :stg_secrets, after stg_sync, 1d
    Image scanning and chart provenance verification         :stg_imgscan, after stg_sync, 1d
    Keycloak config - realm, clients, IdP, RBAC, MFA        :stg_kc, after stg_secrets, 2d
    ArgoCD RBAC hardening and admin account controls         :stg_argohard, after stg_sync, 1d
    Secrets audit - no secrets in Git, least-privilege       :stg_secaudit, after stg_secrets, 1d
    SSL certificate validation                               :stg_ssl, after stg_dns stg_sync, 1d
    End-to-end smoke test                                    :stg_smoke, after stg_kc stg_ssl stg_secaudit stg_imgscan stg_argohard, 2d
    Smoke test gate - all health checks pass                 :milestone, stg_smoke_gate, after stg_smoke, 0d
    Load testing against defined SLOs and capacity targets   :stg_loadtest, after stg_smoke_gate, 2d
    Load test gate - meets p95 latency and error thresholds  :milestone, stg_load_gate, after stg_loadtest, 0d
    Backup restore and Cloud SQL failover drill              :stg_backup, after stg_smoke_gate, 1d
    Vendor validates staging deployment                      :stg_vendorval, after stg_backup stg_load_gate, 1d
    Staging deployment complete                              :milestone, stg_done, after stg_vendorval, 0d

    section Phase 7 - Staging UAT [Customer]
    Customer UAT on staging                                  :stg_uat, after stg_done, 5d
    Bug fixes and config adjustments                         :stg_fix, after stg_uat, 3d
    Customer staging sign-off                                :milestone, stg_signoff, after stg_fix, 0d

    section Phase 8 - Production Infrastructure [Customer IT]
    Create Terraform state bucket for prod                   :prod_state, after stg_signoff, 1d
    Create KMS key for prod and grant service agent roles    :prod_kms, after prod_state, 1d
    Terraform - GCP APIs, VPC peering, Cloud NAT             :prod_net, after prod_kms, 1d
    Terraform - regional GKE cluster with monitoring_config and database_encryption :prod_gke, after prod_net, 2d
    Terraform - Cloud SQL Regional HA and GCS buckets        :prod_sql, after prod_net, 2d
    Terraform - Workload Identity                            :prod_wi, after prod_gke, 1d
    Validate GMP metrics collection and Secrets encryption   :prod_gmpval, after prod_gke, 1d
    Cloud SQL post-deploy                                    :prod_sqlpost, after prod_sql prod_gke, 1d
    Security hardening - PSS, NetworkPolicies, Shielded Nodes :prod_security, after prod_gke, 1d
    IaC validation - terraform validate, tflint, tfsec       :prod_iactest, after prod_security, 1d
    IaC gate - zero critical tfsec findings                  :milestone, prod_iac_gate, after prod_iactest, 0d

    section Phase 9 - Production Add-ons and Apps [Customer IT]
    Deploy nginx-ingress with Cloud Armor WAF, cert-manager   :prod_addons, after prod_wi, 2d
    Deploy ArgoCD, external-secrets, external-dns            :prod_addons2, after prod_addons, 1d
    Deploy Image Updater with Git write-back and token CronJob :prod_imgupd, after prod_addons2, 1d
    Enable GKE Managed Prometheus and Cloud Monitoring        :prod_monitor_deploy, after prod_addons2, 1d
    Configure Cloud Logging sinks, retention, and alerts     :prod_audit, after prod_monitor_deploy, 1d
    Deploy Binary Authorization or OPA Gatekeeper policies   :prod_binauth, after prod_addons2, 1d
    ArgoCD config - repo, project, app with Image Updater annotations :prod_argo, after prod_imgupd, 2d
    Initial sync and health validation                       :prod_sync, after prod_argo, 1d
    Secrets setup with CMEK and rotation                     :prod_secrets, after prod_sync, 1d
    Image scanning and chart provenance verification         :prod_imgscan, after prod_sync, 1d
    Keycloak config - realm, clients, IdP, RBAC, MFA        :prod_kc, after prod_secrets, 2d
    ArgoCD RBAC hardening and admin account controls         :prod_argohard, after prod_sync, 1d
    SSL certificate and DNS validation                       :prod_ssl, after prod_addons prod_sync, 2d
    End-to-end smoke test                                    :prod_smoke, after prod_kc prod_ssl prod_argohard, 2d
    Smoke test gate - all health checks pass                 :milestone, prod_smoke_gate, after prod_smoke, 0d
    Backup restore and Cloud SQL HA failover drill           :prod_backup, after prod_smoke_gate, 1d
    Load testing against defined SLOs and capacity targets   :prod_loadtest, after prod_smoke_gate, 2d
    Load test gate - meets p95 latency and error thresholds  :milestone, prod_load_gate, after prod_loadtest, 0d
    Vendor validates production deployment                   :prod_vendorval, after prod_backup prod_load_gate, 1d
    Production deployment complete                           :milestone, prod_done, after prod_vendorval, 0d

    section Phase 10 - Go-Live [Customer]
    Customer UAT on production                               :prod_uat, after prod_done, 3d
    SLO dashboards and alert routing validation              :prod_monitor, after prod_done, 2d
    Incident runbooks and escalation procedures              :prod_runbooks, after prod_done, 3d
    Security assessment and vulnerability scan               :prod_secassess, after prod_done, 2d
    Security gate - no critical or high vulnerabilities      :milestone, prod_sec_gate, after prod_secassess, 0d
    Rollback drill - ArgoCD, Terraform state, DB migration   :prod_rollback, after prod_done, 2d
    Cross-region DR failover drill (if DR selected)          :prod_dr_drill, after prod_rollback, 2d
    GitOps repository failover test (if DR selected)         :prod_git_dr, after prod_dr_drill, 1d
    Go-live approval                                         :milestone, golive, after prod_uat prod_sec_gate prod_monitor prod_runbooks prod_git_dr, 0d
    Go-live cutover - DNS and final alerting                 :cutover, after golive, 1d
    Go-live complete                                         :milestone, live, after cutover, 0d

    section Phase 11 - Post Go-Live
    Hypercare - vendor tier-2 tier-3, customer monitors      :hypercare, after live, 10d
    Incident response drills                                 :postgo_drills, after live, 2d
    Knowledge transfer - advanced runbooks                   :kt_runbooks, after live, 3d
    Finalize operational readiness deliverables              :kt_deliverables, after live, 5d
    On-call rotation setup and escalation procedures         :kt_oncall, after kt_runbooks, 2d
    Handover to customer-managed steady-state                :milestone, steady, after hypercare kt_runbooks kt_deliverables kt_oncall, 0d

    section Ongoing - App Image Updates [Automatic]
    New image pushed to container registry                   :upd_img_push, after steady, 1d
    Image Updater detects tag, writes back to Git            :upd_img_stg, after upd_img_push, 0d
    ArgoCD syncs new image to staging                        :upd_img_stgsync, after upd_img_stg, 0d
    Customer validates staging                               :upd_img_stgval, after upd_img_stgsync, 1d
    Customer promotes to production via sync window          :upd_img_prod, after upd_img_stgval, 1d
    Image update complete                                    :milestone, upd_img_done, after upd_img_prod, 0d

    section Ongoing - Infra and Config Changes [Customer-Managed]
    Vendor publishes release notes and Terraform changes     :upd_release, after upd_img_done, 1d
    Image scanning and chart provenance verification         :upd_scan, after upd_release, 1d
    Customer IT evaluates changes                            :upd_eval, after upd_scan, 3d
    Customer applies Terraform to staging                    :upd_stg, after upd_eval, 2d
    Customer validates staging                               :upd_stgval, after upd_stg, 2d
    Customer applies to production via sync window           :upd_prod, after upd_stgval, 1d
    Customer validates production                            :upd_prodval, after upd_prod, 1d
    Update cycle complete                                    :milestone, upd_done, after upd_prodval, 0d
Loading

Customer-Managed: Key Characteristics

  • ~3 weeks of training before execution begins (GCP, Terraform, GKE, Helm, ArgoCD, GitOps)
  • Customer IT owns execution with vendor review checkpoints at each milestone
  • Two-layer deployment: Terraform deploys cluster add-ons via helm_release resources (nginx, cert-manager, external-secrets, external-dns, ArgoCD); ArgoCD then manages only the application workloads via GitOps
  • GKE topology: regional cluster, multi-zone node pools, HPA/VPA autoscaling, PodDisruptionBudgets
  • Cloud SQL: PostgreSQL 15, private IP via VPC peering, Regional HA for production, failover drill before go-live
  • Security hardening: Cloud Armor WAF with pre-configured OWASP Top 10 rulesets (SQL injection, XSS, RCE, LFI) and rate-limiting policies; Pod Security Standards (Restricted profile); NetworkPolicies (default-deny with explicit allow rules); Keycloak RBAC with MFA; ArgoCD admin controls; image admission policies; GKE Application-layer Secrets Encryption with CMEK
  • Observability (all GCP-native): GKE Managed Prometheus for metrics collection (PromQL-compatible, stored in Cloud Monitoring); Cloud Monitoring for dashboards, SLOs, and alerting; Cloud Logging for centralized log aggregation (enabled by default on GKE); Cloud Audit Logs for security and compliance — no self-managed observability stack to operate
  • Vendor validates staging infra, staging deployment, and production deployment before progression (via defined validation model -- see Vendor Validation Model section)
  • App image updates: automatic via ArgoCD Image Updater with Git write-back — new images detected, tag committed to Git via least-privilege token for auditability; customer validates staging then promotes to production
  • Infra and config changes: vendor publishes release notes → IaC validation (tflint, tfsec) → customer evaluates, tests on staging, promotes to production within sync window
  • Rollback: validated drill covering ArgoCD rollback, Terraform state recovery, and DB migration backout using Flyway/Liquibase with backward-compatible migrations
  • No vendor access to customer GCP — customer uses their own IAM credentials; vendor validation occurs via defined model (time-bound access, evidence, or screen-share)
  • Supply chain security: Binary Authorization / OPA Gatekeeper enforced as mandatory pre-production milestone
  • GKE clusters: private clusters with master authorized networks, Shielded Nodes
  • Private endpoint policy: Same as vendor-managed -- architecture review decides enable_private_endpoint setting based on customer security requirements and confirmed bastion/VPN access path.
  • Maintenance lifecycle: defined GKE upgrade cadence, add-on versioning, and patching policy for post-go-live
  • Deliverables: architecture diagram, runbooks (incident response, routine operations, troubleshooting), access matrix, SLO definitions with dashboard templates, incident response and escalation procedures, maintenance schedule, DR procedures, on-call rotation template

Comparison

Dimension Vendor-Managed Customer-Managed
Go-live timeline ~9 weeks ~13 weeks
Training overhead None ~3 weeks
Customer IT effort Minimal (approvals + UAT) Heavy (all execution)
Deployment risk Low (vendor expertise) Medium (learning curve)
Time to staging ~3 weeks after prereqs ~6 weeks after prereqs
Ongoing update effort (customer) Review & approve (~hours) Evaluate, test, apply (~days)
Vendor dependency (ongoing) High Low
Operational self-sufficiency Low High
Access model WIF (scoped, auditable, revocable) Customer-only IAM
Deployment project cost Lower (faster) Higher (training + longer timeline)
Ongoing operations cost Higher (vendor management fee) Lower (internal team)
Security posture Good (WIF, audit trail) Better (no external access)

Service Architecture: GCP-Native vs. Cluster Add-ons

The deployment maximizes GCP-managed services to reduce operational overhead. Cluster add-ons are used only where GCP lacks an equivalent or where the existing Terraform modules already configure them.

Deployment mechanism (two-layer architecture):

  • Layer 1 — Terraform (cluster-addons.tf): Deploys all cluster add-ons via helm_release resources as part of terraform apply. This includes nginx-ingress, cert-manager, external-secrets, external-dns, and ArgoCD itself. Terraform also creates namespaces, service accounts, and ClusterIssuers.
  • Layer 2 — ArgoCD: Manages only the application workloads. ArgoCD watches the application repository (separate from the infra repo) and syncs Helm releases to the em-semi-app, em-semi-workflow, and em-semi-keycloak namespaces. ArgoCD does not manage its own add-ons or other cluster infrastructure.
Category Service Type Rationale
Compute GKE (regional, private) GCP-native Primary workload platform
Database Cloud SQL (PostgreSQL 15) GCP-native Managed HA, automated backups, PITR
Object Storage Cloud Storage (GCS) GCP-native Data, cache, workflows, logs buckets
DNS Cloud DNS GCP-native Managed zones, DNSSEC
DNS Automation external-dns (v1.14.0) Cluster add-on K8s-native DNS record lifecycle; uses Cloud DNS as provider
NAT Cloud NAT GCP-native Outbound internet for private nodes
IAM Cloud IAM + Workload Identity GCP-native Pod-level identity, no key files
Secrets Backend Secret Manager GCP-native CMEK encryption, audit trail, rotation
Secrets Sync external-secrets (v0.9.11) Cluster add-on Bridges Secret Manager to K8s Secrets; standard GitOps pattern
Ingress nginx-ingress (v4.9.0) Cluster add-on Already configured in repo with LoadBalancer service type
TLS Certificates cert-manager (v1.14.0) Cluster add-on Let's Encrypt + DNS-01 via Cloud DNS; already configured in repo
GitOps ArgoCD (v5.51.6) Cluster add-on No GCP equivalent; core deployment mechanism
Image Updates ArgoCD Image Updater Cluster add-on Git write-back for auditability; no GCP equivalent
Metrics GKE Managed Prometheus GCP-native Managed collection pipeline, PromQL-compatible, Cloud Monitoring storage
Dashboards Cloud Monitoring GCP-native SLOs, alerting policies, custom dashboards
Logging Cloud Logging GCP-native Enabled by default on GKE; Log Router for export
Audit Cloud Audit Logs GCP-native Admin Activity, Data Access, System Event logs
Container Registry Artifact Registry GCP-native Docker image storage, vulnerability scanning
Build Cloud Build GCP-native CI/CD (API enabled)
Identity Keycloak Cluster add-on Application-level IdP; GCP equivalent (Identity Platform) doesn't cover all use cases
Query Analytics Cloud SQL Query Insights GCP-native Built-in query performance analysis; enabled by default in Terraform module
Notifications ArgoCD Notifications Cluster add-on Slack, webhook, email notifications for deployment events; configured via annotations
Admission Control Binary Authorization or OPA Gatekeeper GCP-native or add-on Binary Authorization is GCP-native; OPA Gatekeeper is an alternative

Self-managed components eliminated by using GCP-native services:

  • Prometheus server → GKE Managed Prometheus (managed collection, Cloud Monitoring backend)
  • Grafana → Cloud Monitoring dashboards (optional: deploy Grafana with Cloud Monitoring data source for advanced visualization)
  • Loki + Promtail → Cloud Logging (enabled by default, zero deployment)
  • Alertmanager → Cloud Monitoring alerting policies with notification channels

Actions by Role and Automation Level

Vendor-Managed Model

Phase Action Actor Automation Tooling
Prerequisites Provision GCP project, billing alerts Customer Manual GCP Console, gcloud
Create VPC, subnets, firewall rules Customer Manual GCP Console or Terraform
Establish on-prem VPN connectivity Customer Manual Cloud VPN, network appliance
Configure WIF pool and OIDC provider Customer Manual gcloud, Terraform
Create deployment SAs and assign IAM roles Customer Manual gcloud, Terraform
Validate WIF and SA impersonation Vendor Manual gcloud auth, CI test job
Planning Kickoff, architecture review, network design Joint Manual Meetings, documentation
Environment config (terraform.tfvars) Vendor Manual Text editor
Customer design sign-off Customer Manual Approval gate
Infrastructure Terraform plan and apply (GKE, Cloud SQL, GCS, WI) Vendor Semi-automated Terraform CLI or CI/CD pipeline
IaC validation (tflint, tfsec, checkov) Vendor Automated CI pipeline on PR
Security hardening (PSS, NetworkPolicies) Vendor Manual kubectl, Terraform
Cloud SQL post-deploy (extensions, users) Vendor Manual psql, gcloud
Add-ons Deploy cluster add-ons via Terraform helm_release (nginx, cert-manager, external-secrets, external-dns, ArgoCD) Vendor Semi-automated Terraform (helm_release resources)
Enable GKE Managed Prometheus, create Cloud Monitoring dashboards Vendor Semi-automated Terraform, gcloud
Configure Cloud Logging sinks and alerts Vendor Manual gcloud, Terraform
Deploy Binary Authorization or OPA Gatekeeper Vendor Semi-automated Terraform, gcloud
DNS delegation and NS record setup Vendor Manual Cloud DNS, customer DNS registrar
App Deployment Apply ArgoCD Project and Application manifests Vendor Manual kubectl apply (ArgoCD CRDs)
Initial sync and health validation Vendor Automated ArgoCD auto-sync
Keycloak config (realm, clients, IdP, RBAC, MFA) Vendor Manual Keycloak admin console
Populate secrets in GCP Secret Manager Vendor Manual gcloud
SSL certificate validation Vendor Manual curl, openssl
End-to-end smoke test Vendor Semi-automated Test scripts, manual verification
Load testing Vendor Semi-automated k6, Locust, or similar
Backup restore and failover drill Vendor Manual gcloud, psql
UAT User acceptance testing Customer Manual Application UI
Bug fixes and config adjustments Vendor Manual Code changes, Terraform
Customer sign-off Customer Manual Approval gate
Go-Live Customer UAT on production Customer Manual Application UI
SLO dashboards and alert routing validation Joint Manual Cloud Monitoring
Incident runbooks and escalation procedures Vendor Manual Documentation
Security assessment and vulnerability scan Joint Semi-automated Scanner tools
Rollback drill Vendor Manual ArgoCD, Terraform, psql
DNS cutover and final alerting Vendor Manual Cloud DNS, Cloud Monitoring
Go-live approval Customer Manual Approval gate
Post Go-Live Hypercare monitoring and on-call Vendor Manual + automated alerts Cloud Monitoring, PagerDuty
Ongoing - Images New image pushed to registry Vendor CI Automated GitHub Actions, Artifact Registry
Image Updater detects new tag (staging) Automatic Automated ArgoCD Image Updater (polling)
Tag written back to Git (staging branch) Automatic Automated Image Updater Git write-back
ArgoCD syncs new image to staging Automatic Automated ArgoCD auto-sync
Vendor validates staging deployment Vendor Manual kubectl, dashboards
Vendor opens promotion PR for production Vendor Manual Git, GitHub
Customer approves production promotion PR Customer Manual GitHub PR review
ArgoCD syncs approved image to production Automatic Automated ArgoCD auto-sync
Vendor validates production deployment Vendor Manual kubectl, dashboards
Ongoing - Infra Vendor proposes change via PR Vendor Manual Git, GitHub
IaC validation and image scanning on PR Automatic Automated CI pipeline
Customer reviews and approves PR Customer Manual GitHub PR review
Vendor merges and runs terraform apply Vendor Semi-automated Git merge, Terraform CLI or CI/CD
ArgoCD syncs app-layer changes (if any) Automatic Automated ArgoCD auto-sync
Deployment health validation Vendor Manual kubectl, Cloud Monitoring

Customer-Managed Model

Phase Action Actor Automation Tooling
Prerequisites Provision GCP project, billing alerts Customer Manual GCP Console, gcloud
Create VPC, subnets, IAM roles Customer Manual GCP Console or Terraform
Establish on-prem VPN connectivity Customer Manual Cloud VPN, network appliance
Training GCP fundamentals, Terraform, GKE, Helm, ArgoCD Vendor-led Manual Workshops, hands-on labs
Hands-on labs (dev environment deployment) Customer Manual (guided) Terraform, kubectl, Helm
Planning Architecture review, network design Joint Manual Meetings, documentation
Customer prepares terraform.tfvars Customer Manual Text editor
Vendor reviews customer config Vendor Manual Code review
Select vendor validation model Joint Manual Decision gate
Design sign-off Customer Manual Approval gate
Infrastructure Terraform plan review with vendor Joint Manual Terraform plan output
Terraform plan and apply (GKE, Cloud SQL, GCS, WI) Customer IT Semi-automated Terraform CLI or CI/CD pipeline
IaC validation (tflint, tfsec, checkov) Customer IT Automated CI pipeline on PR
Security hardening (PSS, NetworkPolicies) Customer IT Manual kubectl, Terraform
Cloud SQL post-deploy (extensions, users) Customer IT Manual psql, gcloud
Vendor validates staging infra Vendor Manual Per validation model (A/B/C)
Add-ons Deploy cluster add-ons via Terraform helm_release (nginx, cert-manager, external-secrets, external-dns, ArgoCD) Customer IT Semi-automated Terraform (helm_release resources)
Enable GKE Managed Prometheus, create Cloud Monitoring dashboards Customer IT Semi-automated Terraform, gcloud
Configure Cloud Logging sinks and alerts Customer IT Manual gcloud, Terraform
Deploy Binary Authorization or OPA Gatekeeper Customer IT Semi-automated Terraform, gcloud
DNS delegation and NS record setup Customer IT Manual Cloud DNS
App Deployment Apply ArgoCD Project and Application manifests Customer IT Manual kubectl apply (ArgoCD CRDs)
Initial sync and health validation Customer IT Automated ArgoCD auto-sync
Keycloak config (realm, clients, IdP, RBAC, MFA) Customer IT Manual Keycloak admin console
Secrets setup in GCP Secret Manager Customer IT Manual gcloud
End-to-end smoke test Customer IT Semi-automated Test scripts, manual verification
Load testing Customer IT Semi-automated k6, Locust, or similar
Backup restore and failover drill Customer IT Manual gcloud, psql
Vendor validates staging deployment Vendor Manual Per validation model (A/B/C)
UAT User acceptance testing Customer Manual Application UI
Bug fixes and config adjustments Customer IT Manual Code changes, Terraform
Customer sign-off Customer Manual Approval gate
Go-Live Customer UAT on production Customer Manual Application UI
SLO dashboards and alert routing validation Customer Manual Cloud Monitoring
Incident runbooks and escalation procedures Customer Manual Documentation
Security assessment and vulnerability scan Customer Semi-automated Scanner tools
Rollback drill Customer IT Manual ArgoCD, Terraform, psql
DNS cutover and final alerting Customer IT Manual Cloud DNS, Cloud Monitoring
Go-live approval Customer Manual Approval gate
Post Go-Live Hypercare monitoring (customer primary, vendor tier-2/3) Joint Manual + automated alerts Cloud Monitoring, PagerDuty
Knowledge transfer (advanced runbooks) Vendor Manual Documentation, workshops
Ongoing - Images New image pushed to registry Vendor CI Automated GitHub Actions, Artifact Registry
Image Updater detects new tag Automatic Automated ArgoCD Image Updater (polling)
Tag written back to Git Automatic Automated Image Updater Git write-back
ArgoCD syncs new image to staging Automatic Automated ArgoCD auto-sync
Customer validates staging Customer Manual Application UI, dashboards
Customer promotes to production via sync window Customer Manual ArgoCD sync or Git merge
Ongoing - Infra Vendor publishes release notes and Terraform changes Vendor Manual Git, documentation
IaC validation and image scanning Automatic Automated CI pipeline
Customer IT evaluates changes Customer IT Manual Code review, documentation
Customer applies Terraform to staging Customer IT Semi-automated Terraform CLI or CI/CD
Customer validates staging Customer Manual Application UI, dashboards
Customer applies to production via sync window Customer IT Semi-automated Terraform CLI or CI/CD
Customer validates production Customer Manual Application UI, dashboards

Automation Level Definitions

Level Definition Examples
Automated Runs without human intervention; triggers on events ArgoCD auto-sync, Image Updater polling, CI pipeline on PR
Semi-automated Human initiates; tooling executes terraform apply, helm install, load test run
Manual Human performs directly; requires judgment Architecture review, Keycloak config, UAT, approval gates
Manual + automated alerts Human monitors; system generates alerts Hypercare period with Cloud Monitoring alerting

Recommendation

Vendor-managed is recommended for initial deployments:

  1. Faster time to value — staging delivered ~3 weeks sooner
  2. Lower deployment risk — vendor knows the Terraform module dependency chain, Cloud SQL post-deploy steps, Workload Identity binding topology, and ArgoCD sync policy nuances
  3. WIF eliminates the security concern — scoped IAM roles, Cloud Audit Logs, instant revocation via pool deletion; no long-lived credentials
  4. Lightweight ongoing model for customer — review a diff, approve, ArgoCD auto-syncs; ~hours not days per update cycle
  5. Transition path exists — customer can move to self-managed later with condensed training against a working system (more effective than training before deployment)

Choose customer-managed when: the customer has a strong platform engineering team, regulatory constraints prohibit any external infrastructure access, or building deep GCP/K8s competency is a strategic goal.


Vendor Access: Workload Identity Federation (WIF)

For the vendor-managed model, WIF is recommended over traditional service account keys:

Aspect Service Account Key Workload Identity Federation
Credential type Long-lived JSON key file Short-lived OIDC tokens
Key rotation Manual, error-prone Automatic (token-based)
Revocation Delete key, redeploy Disable WIF pool (instant)
Audit trail Cloud Audit Logs Cloud Audit Logs + OIDC claims
CI/CD integration Store key as secret Native GitHub Actions OIDC
Blast radius Key leak = full access until rotated Token expires in minutes

Setup (customer responsibility):

  1. Create a Workload Identity Pool in their GCP project
  2. Add an OIDC provider (vendor's GitHub Actions or Google Workspace) with attribute constraints (repo, environment, branch)
  3. Create separate deployment Service Accounts for staging and production with tightly scoped custom IAM roles (never use primitive roles like Editor or broad admin roles). Use custom roles with minimal permissions per phase:
    • Infrastructure provisioning: roles/container.clusterAdmin (not container.admin), roles/cloudsql.editor (not cloudsql.admin), roles/storage.objectAdmin on specific buckets
    • Workload Identity: roles/iam.serviceAccountUser scoped to specific SAs via IAM Conditions
    • DNS: roles/dns.admin scoped to specific managed zones
    • Secrets: roles/secretmanager.secretVersionManager (not secretmanager.admin)
    • Networking: roles/compute.networkAdmin for VPC peering and Cloud NAT creation, roles/compute.routerAdmin for Cloud Router management; downgrade to roles/compute.networkUser on specific subnets for runtime workloads after infrastructure provisioning is complete
    • Prefer custom roles with only the exact permissions required; use IAM Conditions to restrict by resource name, environment label, or time window where possible
  4. Grant roles/iam.workloadIdentityUser on the SA to the WIF pool (SA impersonation pattern)
  5. Share the WIF pool ID and project number with vendor
  6. Schedule periodic access reviews (quarterly recommended)
  7. Permission audits: Conduct quarterly reviews of SA permissions using IAM Recommender to identify and remove unused roles

Deployment Notes

VPC and Network Configuration

The Terraform modules reference a VPC by name. The existing repo defaults to the default VPC and subnet with auto-allocated secondary IP ranges for GKE pods/services. For customer deployments:

  • VPC: The customer's VPC name and subnet CIDR ranges must be specified in terraform.tfvars. The customer may use an existing shared VPC or a dedicated VPC — this must be resolved during the architecture review (Phase 1)
  • Subnets: Define secondary IP ranges for GKE pods and services explicitly (the repo currently uses auto-allocation, which should be replaced with planned CIDR ranges for production)
  • Cloud SQL: Connects via private IP over VPC peering, which requires the VPC to have Private Services Access configured (the vpc-peering module handles this)
  • Cloud DNS: The repo uses a shared managed zone pattern (data.google_dns_managed_zone.shared) referencing a pre-existing zone, not creating one per environment. For customer deployments, determine whether the customer provides an existing zone or a new one is created via the cloud-dns module

GKE Cluster Topology

Production clusters should be regional (not zonal) with multi-zone node pools for HA. The architecture review must define:

  • Regional cluster with control plane replicated across 3 zones
  • Node pool autoscaling ranges (min/max nodes per pool)
  • PodDisruptionBudgets for all critical workloads
  • HPA (Horizontal Pod Autoscaler) targets for application services
  • VPA (Vertical Pod Autoscaler) recommendations for right-sizing
  • Cluster Autoscaler configuration for node-level scaling

Disaster Recovery (Cross-Region)

The architecture review must clarify cross-region DR requirements based on RTO/RPO targets:

  • Single-region HA (default): Multi-zone regional GKE + Cloud SQL Regional HA covers zone-level failures; sufficient for most deployments
  • Cross-region DR (if required by RTO/RPO): Plan for cross-region Cloud SQL read replica promotion, GCS cross-region replication, and a passive standby GKE cluster in a secondary region
  • Decision criteria: If RTO < 1 hour and RPO < 5 minutes for regional outages, cross-region DR is recommended; otherwise, single-region HA with backup/restore is sufficient
  • DR drill: If cross-region is implemented, include a DR failover drill in Phase 8 (Go-Live) tasks
  • GitOps repository DR: The Git repository is the single source of truth for all infrastructure and application configuration. Ensure the Git hosting provider (e.g., GitHub) has redundancy. Additionally, configure a mirror repository (e.g., GitHub -> Cloud Source Repositories or a self-hosted GitLab) that can serve as a fallback if the primary Git provider experiences an outage. Include Git repository recovery in the DR drill.

Observability Stack (GCP-Native)

The observability stack uses GCP-managed services, eliminating self-managed Prometheus/Grafana/Loki deployments:

Metrics — GKE Managed Prometheus (GMP):

  • GMP is a built-in GKE feature (enable managed_prometheus on the cluster via monitoring_config { enable_components = ["SYSTEM_COMPONENTS"] managed_prometheus { enabled = true } } in the GKE Terraform module); no Helm charts or PVCs to manage
  • Implementation note: The existing GKE Terraform module uses the legacy monitoring_service attribute. This must be replaced with the monitoring_config block to enable GMP. The module update should be included as a prerequisite task in Phase 2 (Staging Infrastructure).
  • Runs a managed collection pipeline on each node; scrapes Prometheus-format metrics and writes to Cloud Monitoring
  • Fully PromQL-compatible — existing dashboards and alerting rules work without modification
  • Retention handled by Cloud Monitoring (free tier: 24 months for GCP metrics, custom metrics billed per sample ingested)
  • Resource overhead: GMP collection pods run as a DaemonSet with low resource footprint (~50-100MB per node); significantly less than self-managed Prometheus

Dashboards and Alerting — Cloud Monitoring:

  • Native SLO monitoring with burn-rate alerting
  • Custom dashboards via Cloud Monitoring console or Terraform (google_monitoring_dashboard)
  • Alerting policies with notification channels (PagerDuty, Slack, email, Pub/Sub)
  • Uptime checks for external endpoint monitoring
  • No Grafana deployment needed (optional: deploy Grafana with Cloud Monitoring as a data source if advanced visualization is required)

Logging — Cloud Logging:

  • Enabled by default on GKE; no agents to deploy
  • Cost management: configure log exclusion filters for high-volume debug/trace logs; route retained logs to GCS or BigQuery at lower cost
  • See "Centralized Logging (Cloud Logging)" section for full details

Resource planning:

  • Keycloak: resource-intensive; define CPU/memory requests/limits explicitly
  • Dedicated node pools: consider for Keycloak in production to isolate from application pods
  • Cloud Monitoring costs: estimate custom metrics volume (GMP ingestion) during load testing; use metrics exclusion filters for high-cardinality labels
  • Performance validation: validate GMP collection overhead during load testing

Cloud SQL HA

The Terraform module defaults to ZONAL availability. For production, terraform.tfvars must explicitly set cloudsql_availability_type = "REGIONAL" to enable automatic failover. The failover drill before go-live validates connection retry behavior, PgBouncer reconnection (if applicable), and application recovery within the defined RTO.

Capacity Planning

A capacity planning task should be completed during the architecture review (Phase 1) and validated during load testing:

  • Cloud SQL sizing: Define vCPU/memory tier based on expected concurrent connections and query volume; the module defaults to max_connections = 100 which may be insufficient for production — adjust via database flags in terraform.tfvars
  • Connection pooling: Deploy PgBouncer as a sidecar or standalone deployment in GKE (Cloud SQL does not provide a native connection pooler). Alternatively, use Cloud SQL Auth Proxy with --max-connections flag. Define max pool size per service based on expected concurrency.
  • GKE node autoscaler bounds: Set min_node_count and max_node_count based on pod resource requests and expected workload; validate that GCP project quotas (vCPUs, IP addresses, persistent disks) can accommodate the maximum node count
  • GCP quota checks: Verify regional quotas for Compute Engine, Cloud SQL, GCS, and networking before deployment; request quota increases proactively if needed
  • Validation: Load testing results must confirm that the capacity plan supports 2x expected peak load with headroom

Load Testing Criteria

Load tests must have defined pass/fail criteria agreed during the architecture review:

  • Target throughput (RPS) and latency percentiles (p50, p95, p99)
  • Cloud SQL connection pool limits and saturation behavior
  • GKE node autoscaling response under load
  • Error rate thresholds (e.g., < 0.1% 5xx during steady state)
  • Capacity headroom validation (sustain 2x expected peak)

IaC Validation Pipeline

Before any terraform apply, run automated validation:

  • terraform validate — syntax and internal consistency
  • tflint — Terraform linting with GCP ruleset
  • tfsec / checkov — security policy scanning (no public IPs, encryption at rest, etc.)
  • These checks should be integrated into the CI pipeline for ongoing Terraform PRs

Supply Chain and Runtime Security

  • Image scanning: all container images scanned for CVEs before deployment (Artifact Registry vulnerability scanning or equivalent)
  • Chart provenance: Helm chart signatures verified against known publishers
  • Admission control: GKE Binary Authorization or OPA Gatekeeper policies to enforce only signed/scanned images are deployed -- this is a mandatory pre-production milestone and must be enabled and validated in staging before production deployment begins
  • Policy enforcement scheduling: Binary Authorization / OPA Gatekeeper must be deployed and tested as an explicit task in Phase 3 (staging add-ons) and Phase 7 (production add-ons), with validation that non-compliant images are blocked
  • Exception management: Define an exceptions process for policy violations (e.g., break-glass procedure for emergency deployments with post-hoc review)
  • SAST/DAST: application security testing is the responsibility of the application CI pipeline (separate from infrastructure deployment)

ArgoCD Image Updater Auditability

The Image Updater is configured with Git write-back mode — when it detects a new image tag in the container registry, it commits the updated tag back to the Git repository before ArgoCD syncs. This preserves the GitOps audit trail: every deployed image version has a corresponding Git commit. The ArgoCD Application manifests must include Image Updater annotations specifying which images to watch, the update strategy (semver, latest, digest), and the write-back target branch.

Environment separation for staging-first updates: The existing repo has all ArgoCD Applications tracking the same main branch with identical auto-sync policies. For customer deployments with staging-first gating, the ArgoCD Applications must be configured differently per environment:

  • Staging Application: Include Image Updater annotations (argocd-image-updater.argoproj.io/image-list, argocd-image-updater.argoproj.io/write-back-method: git) so new images are automatically detected and synced
  • Production Application: Do NOT include Image Updater annotations — production image updates require a promotion PR that updates the image tag in the production Helm values file, reviewed and approved before ArgoCD syncs
  • This configuration is applied during the "Apply ArgoCD Project and Application with Image Updater annotations" task in Phase 4 (vendor-managed) or Phase 6 (customer-managed)

Centralized Logging (Cloud Logging)

GKE clusters have Cloud Logging enabled by default — system and workload logs are collected automatically via the GKE logging agent (Fluent-bit-based) with no additional deployment required. This eliminates the need to deploy and manage a self-hosted logging stack (e.g., Loki + Promtail):

  • Zero deployment overhead: Cloud Logging is a managed service; no Helm charts, persistent volumes, or capacity planning for log storage
  • Native GCP integration: Logs Explorer, Log Analytics (BigQuery-backed), Error Reporting, Cloud Trace — all integrated with IAM
  • Log Router: Route logs to BigQuery (for analytics), GCS (for long-term archival), or Pub/Sub (for SIEM integration) via configurable sinks
  • Log retention: Default 30 days in Cloud Logging; extend via Log Router sinks to GCS (cold) or BigQuery (queryable) for compliance requirements
  • Log-based metrics and alerting: Create custom metrics from log entries and configure alerting policies in Cloud Monitoring — no separate alerting stack needed
  • Access control: IAM-based permissions (roles/logging.viewer, roles/logging.privateLogViewer) scoped by project or log bucket
  • Cost management: Configure log exclusion filters to drop high-volume debug/trace logs before ingestion; use log buckets with different retention periods for cost optimization
  • When to consider Loki instead: Only if multi-cloud portability is a hard requirement (i.e., the same deployment must run on AWS/Azure without GCP services)

Terraform State Protection

Terraform state buckets must be configured with the following protections:

  • Object versioning: Enabled on the GCS bucket to allow state recovery from accidental corruption or deletion
  • State locking: Use GCS-native state locking (default with gcs backend) to prevent concurrent modifications
  • Bucket-level access: Uniform bucket-level access with IAM-only permissions (no ACLs)
  • Encryption: Customer-managed encryption keys (CMEK) for state files containing infrastructure details
  • Backup: Cross-region replication for disaster recovery of state files (optional, based on data residency and compliance requirements; if disabled due to regulatory constraints, require versioning plus periodic encrypted backups to a secondary bucket)

GKE Private Cluster Configuration

All GKE clusters (staging and production) must be configured as private clusters:

  • Private nodes: Disable public IP addresses on nodes (enable_private_nodes = true)
  • Private endpoint: Disable the public control plane endpoint (enable_private_endpoint = true) or restrict via master authorized networks
  • Master authorized networks: Limit control plane access to VPN/bastion CIDR ranges only
  • Cloud NAT: Required for outbound internet access from private nodes (already included in Terraform)
  • Access path: Document the authorized access path (e.g., VPN -> bastion -> kubectl, or Cloud Shell with Private Google Access)
  • Prerequisite task: A bastion host or VPN-based kubectl access path must be provisioned and validated as a Phase 0 prerequisite (task cust_access) before any cluster operations can begin. This task validates kubectl get nodes succeeds through the authorized path and confirms master authorized networks are correctly configured.
  • This must be validated during the architecture review (Phase 1) and enforced via tfsec rules

Vendor Validation Model (Customer-Managed)

To resolve the contradiction between "no vendor access to customer GCP" and vendor validation requirements, define one of the following validation models during the architecture review:

  • Option A: Time-bound read-only access -- Customer grants temporary Viewer IAM role with expiration (e.g., 4-hour session) for vendor validation, revoked immediately after
  • Option B: Customer-provided evidence -- Customer provides validation artifacts (screenshots, CLI output, health check reports) using a standardized checklist
  • Option C: Supervised screen-share -- Vendor observes customer-run validation commands via video call, providing real-time guidance
  • The chosen model must be documented in the architecture review deliverables and referenced in the Gantt chart as a prerequisite for each vendor validation task

GKE Application-Layer Secrets Encryption

Kubernetes Secrets stored in etcd must be encrypted at the application layer using CMEK:

  • Enable Application-layer Secrets Encryption on the GKE cluster using a Cloud KMS key (database_encryption { state = "ENCRYPTED", key_name = "projects/.../cryptoKeys/..." } in Terraform)
  • Alternative: Use the Secret Manager CSI Driver to mount secrets directly from GCP Secret Manager as volumes, bypassing etcd storage entirely. This eliminates the risk of plaintext secrets in etcd backups.
  • Namespace-level RBAC: Restrict get/list/watch on Secrets resources to only the service accounts that need them (default ClusterRole grants are too broad)
  • Implementation note: The GKE Terraform module must be extended with a database_encryption block. The KMS key must be created in the customer's project and granted roles/cloudkms.cryptoKeyEncrypterDecrypter to the GKE service agent.

ArgoCD Image Updater Token Security

The Git write-back token used by ArgoCD Image Updater must follow these security controls:

  • Least privilege: Use a GitHub App or fine-grained personal access token (PAT) scoped to the specific repository with contents: write only
  • Secure storage: Store the token in GCP Secret Manager (not as a Kubernetes Secret) and sync via External Secrets Operator
  • Rotation: Enforce token rotation via a CronJob that refreshes the token from Secret Manager at a defined cadence (e.g., every 30 days)
  • Audit: Log all Git write-back commits and monitor for unexpected commit patterns
  • Signed commits: Consider enforcing GPG-signed commits for Image Updater write-backs to prevent commit spoofing
  • CronJob monitoring: Configure Cloud Monitoring alerts for token rotation CronJob failures (kube_job_status_failed metric) and ArgoCD sync state degradation (argocd_app_sync_status{sync_status="OutOfSync"} for prolonged periods). Alert on: CronJob not completing within expected window, consecutive CronJob failures, and ArgoCD unable to push write-back commits.

ArgoCD Notifications

The existing repo includes ArgoCD notification annotations per environment (Slack channels for deployment events, health degradation, and sync failures). For customer deployments:

  • Adapt notification channels: Replace vendor Slack channels with the customer's notification targets (Slack, PagerDuty, email, or webhook)
  • Events to configure: on-deployed, on-health-degraded, on-sync-failed, on-sync-running (production should include all four)
  • ArgoCD metrics: Enable controller and repo-server metrics in staging and production (disabled by default in dev for cost optimization)
  • Implementation: Notification annotations are set on the ArgoCD Application manifests; notification templates and triggers are configured in the ArgoCD ConfigMap

Cloud SQL Query Insights

The existing Terraform module enables Cloud SQL Query Insights by default:

  • query_insights_enabled = true
  • query_plans_per_minute = 5
  • query_string_length = 1024
  • record_application_tags = true

This provides built-in query performance analysis in the GCP Console without additional tooling. For customer deployments, validate that Query Insights is enabled and review the query plan sample rate during load testing.

Centralized Audit Logging and Security Alerting

Both deployment models must include centralized audit log management:

  • Cloud Audit Logs: Ensure Admin Activity, Data Access, and System Event logs are enabled for all GCP services
  • Log sinks: Configure a Cloud Logging sink to export audit logs to a dedicated logging project or BigQuery for long-term retention
  • Retention: Minimum 1 year retention for compliance; 90 days in Cloud Logging, remainder in cold storage (GCS or BigQuery)
  • Security alerts: Configure alerting rules for critical events:
    • IAM policy changes and role grants
    • ArgoCD admin login attempts and RBAC modifications
    • Secret Manager access patterns and unusual secret reads
    • GKE control plane access from unexpected IPs
  • SIEM integration: If the customer uses a SIEM, configure log forwarding via Pub/Sub or direct integration

GKE and Add-on Upgrade Lifecycle

Define a maintenance and patching policy for post-go-live operations:

  • GKE version upgrades: Select an upgrade channel (rapid/regular/stable) based on the customer's risk profile and change-management requirements during the architecture review; use a staging-first approach with staging upgraded 1 week before production
  • Node OS image updates: Enable auto-upgrade for node pools with maintenance windows configured for off-peak hours
  • Add-on versioning: Track Helm chart versions for all add-ons (ArgoCD, cert-manager, external-secrets, etc.) and schedule quarterly update reviews
  • Compatibility testing: Validate add-on compatibility in staging before production upgrades
  • Rollback criteria: Define rollback triggers (e.g., pod crash rate > 5%, health check failures, API errors) and procedures
  • Communication: Establish a maintenance notification process with the customer (minimum 48-hour advance notice for production changes)

Database Migration Strategy

Define migration tooling and procedures for safe schema changes and rollback:

  • Migration tooling: Use Flyway or Liquibase for versioned, repeatable database migrations with a clear migration history
  • Backward compatibility: Enforce backward-compatible schema changes (additive only) to enable safe rollback; breaking changes require a multi-phase migration plan
  • Pre-migration validation: Run migration dry-runs in staging with production-like data volumes before applying to production
  • Post-migration checks: Automated verification of schema state, data integrity, and application health after each migration
  • PITR alignment: Validate that Cloud SQL point-in-time recovery (PITR) can restore to a state before migration within the defined RTO/RPO
  • Rollback rehearsal: Include migration rollback as part of the pre-go-live rollback drill, testing both Flyway/Liquibase undo and PITR recovery paths

Phase-Level Backout Plans

Each deployment phase should have a documented backout procedure:

  • Infrastructure phases: terraform destroy for the specific module, or terraform apply with previous state
  • Add-on phases: helm uninstall for the specific release
  • Application phases: ArgoCD sync to previous Git commit, or argocd app rollback
  • Database phases: point-in-time recovery from automated Cloud SQL backups
  • Go/no-go criteria: defined at each phase boundary to prevent proceeding with a broken foundation. Specific gates:
    • After IaC validation: all tfsec/tflint checks pass with zero critical findings
    • After smoke test: all services respond to health checks, end-to-end user flow completes
    • After load test: meets defined p95 latency, throughput, and error rate thresholds
    • After security assessment: no critical or high vulnerabilities unmitigated
    • These gates should be represented as milestones in the Gantt chart with explicit pass/fail criteria documented in the architecture review deliverables
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment