Two project plan templates (as Mermaid Gantt charts) for deploying the product at customer sites. The two models differ in who manages the deployment lifecycle:
- Vendor-Managed — The vendor provisions, deploys, and operates the platform; the customer approves changes
- Customer-Managed — Customer IT handles provisioning and operations; the vendor trains and advises
Both models deploy Staging + Production environments (separate GKE clusters each). The customer always provides a GCP project, VPC, and on-prem connectivity as prerequisites.
Vendor access (vendor-managed model): Workload Identity Federation (WIF) — no long-lived service account keys, full Cloud Audit Logs trail, instant revocation via WIF pool deletion. Integrates natively with GitHub Actions OIDC for automated Terraform runs.
The vendor executes all infrastructure provisioning, cluster add-on deployment, ArgoCD configuration, and application deployment. Customer responsibilities are limited to prerequisites, approvals, and UAT.
gantt
title Vendor-Managed Deployment
dateFormat YYYY-MM-DD
axisFormat %b %d
section Phase 0 - Prerequisites [Customer]
Provision GCP project and set billing budget alerts :cust_gcp, 2026-03-02, 3d
Create VPC and grant subnet access :cust_vpc, after cust_gcp, 2d
Establish on-prem connectivity via VPN :cust_vpn, after cust_gcp, 5d
Configure WIF pool and OIDC provider :cust_wif, after cust_vpc, 3d
Create staging deployment SA and assign scoped IAM roles :cust_sa_stg, after cust_wif, 1d
Create production deployment SA and assign scoped IAM roles :cust_sa_prod, after cust_sa_stg, 1d
Grant workloadIdentityUser on both SAs to WIF pool :cust_iam, after cust_sa_prod, 1d
Vendor validates WIF and SA impersonation for both SAs :cust_wifval, after cust_iam, 1d
Provision bastion host or validate VPN kubectl access path :cust_access, after cust_vpn, 2d
Prerequisites complete :milestone, prereq_done, after cust_wifval cust_access, 0d
section Phase 1 - Planning and Design [Joint]
Kickoff and requirements gathering :kickoff, after prereq_done, 2d
Architecture review - regional GKE, HA, SLOs, RTO-RPO :arch, after kickoff, 2d
Network design - VPC, subnets, NAT, firewall, peering :netdesign, after kickoff, 2d
Environment config - terraform.tfvars for stg and prod :envconfig, after arch, 1d
Customer sign-off on design :milestone, design_approved, after envconfig netdesign, 0d
section Phase 2 - Staging Infrastructure [Vendor]
Create Terraform state bucket for staging :stg_state, after design_approved, 1d
Create KMS key and grant service agent roles :stg_kms, after stg_state, 1d
Terraform - GCP APIs, VPC peering, Cloud NAT :stg_net, after stg_kms, 1d
Terraform - regional GKE cluster with monitoring_config and database_encryption :stg_gke, after stg_net, 2d
Terraform - Cloud SQL PostgreSQL 15 private IP :stg_sql, after stg_net, 1d
Terraform - GCS buckets :stg_gcs, after stg_net, 1d
Terraform - Workload Identity SAs and K8s bindings :stg_wi, after stg_gke, 1d
Validate GMP metrics collection and Secrets encryption :stg_gmpval, after stg_gke, 1d
Cloud SQL post-deploy - extensions, users, secrets :stg_sqlpost, after stg_sql stg_gke, 1d
Security hardening - PSS, NetworkPolicies, Shielded Nodes :stg_security, after stg_gke, 1d
IaC validation - terraform validate, tflint, tfsec :stg_iactest, after stg_security, 1d
IaC gate - zero critical tfsec findings :milestone, stg_iac_gate, after stg_iactest, 0d
section Phase 3 - Staging Add-ons [Vendor]
Deploy nginx-ingress and Cloud Armor WAF policy :stg_nginx, after stg_wi, 1d
Deploy cert-manager v1.14.0 and ClusterIssuers :stg_cert, after stg_nginx, 1d
Deploy external-secrets v0.9.11 :stg_es, after stg_wi, 1d
Deploy external-dns v1.14.0 :stg_edns, after stg_cert, 1d
Deploy ArgoCD v5.51.6 :stg_argo, after stg_cert stg_nginx, 1d
Deploy Image Updater with Git write-back and token CronJob :stg_imgupd, after stg_argo, 1d
Enable GKE Managed Prometheus and Cloud Monitoring :stg_monitor, after stg_argo, 1d
Configure Cloud Logging sinks, retention, and alerts :stg_audit, after stg_monitor, 1d
Deploy Binary Authorization or OPA Gatekeeper policies :stg_binauth, after stg_argo, 1d
Create K8s namespaces and service accounts :stg_ns, after stg_imgupd, 1d
DNS delegation and NS record verification :stg_dns, after stg_edns, 1d
section Phase 4 - Staging App Deployment [Vendor]
Configure ArgoCD repo access for product :stg_repo, after stg_imgupd, 1d
Apply ArgoCD Project and Application with Image Updater annotations :stg_app, after stg_repo stg_ns, 1d
Initial sync and health validation :stg_sync, after stg_app, 1d
Keycloak config - realm, clients, IdP, RBAC, MFA :stg_kc, after stg_sync, 2d
ArgoCD RBAC hardening and admin account controls :stg_argohard, after stg_sync, 1d
Populate GCP Secret Manager secrets with CMEK :stg_secrets, after stg_sync, 1d
Secrets audit - no secrets in Git, least-privilege :stg_secaudit, after stg_secrets, 1d
SSL certificate validation :stg_ssl, after stg_dns stg_sync, 1d
Image scanning and chart provenance verification :stg_imgscan, after stg_sync, 1d
End-to-end smoke test all services :stg_smoke, after stg_kc stg_ssl stg_secaudit stg_imgscan stg_argohard, 2d
Smoke test gate - all health checks pass :milestone, stg_smoke_gate, after stg_smoke, 0d
Load testing against defined SLOs and capacity targets :stg_loadtest, after stg_smoke_gate, 2d
Load test gate - meets p95 latency and error thresholds :milestone, stg_load_gate, after stg_loadtest, 0d
Backup restore and Cloud SQL failover drill :stg_backup, after stg_smoke_gate, 1d
Staging deployment complete :milestone, stg_done, after stg_backup stg_load_gate, 0d
section Phase 5 - Staging UAT [Customer and Vendor]
Customer UAT on staging :stg_uat, after stg_done, 5d
Bug fixes and configuration adjustments :stg_fix, after stg_uat, 2d
Customer staging sign-off :milestone, stg_signoff, after stg_fix, 0d
section Phase 6 - Production Infrastructure [Vendor]
Create Terraform state bucket for prod :prod_state, after stg_signoff, 1d
Create KMS key for prod and grant service agent roles :prod_kms, after prod_state, 1d
Terraform - GCP APIs, VPC peering, Cloud NAT :prod_net, after prod_kms, 1d
Terraform - regional GKE cluster with monitoring_config and database_encryption :prod_gke, after prod_net, 2d
Terraform - Cloud SQL PostgreSQL 15 Regional HA :prod_sql, after prod_net, 1d
Terraform - GCS buckets and Workload Identity :prod_misc, after prod_gke, 1d
Validate GMP metrics collection and Secrets encryption :prod_gmpval, after prod_gke, 1d
Cloud SQL post-deploy :prod_sqlpost, after prod_sql prod_gke, 1d
Security hardening - PSS, NetworkPolicies, Shielded Nodes :prod_security, after prod_gke, 1d
IaC validation - terraform validate, tflint, tfsec :prod_iactest, after prod_security, 1d
IaC gate - zero critical tfsec findings :milestone, prod_iac_gate, after prod_iactest, 0d
section Phase 7 - Production Add-ons and Apps [Vendor]
Deploy nginx-ingress with Cloud Armor WAF policy :prod_addons1, after prod_misc, 2d
Deploy cert-manager, external-secrets, external-dns :prod_addons1b, after prod_addons1, 1d
Deploy ArgoCD :prod_addons2, after prod_addons1b, 1d
Deploy Image Updater with Git write-back and token CronJob :prod_imgupd, after prod_addons2, 1d
Enable GKE Managed Prometheus and Cloud Monitoring :prod_monitor_deploy, after prod_addons2, 1d
Configure Cloud Logging sinks, retention, and alerts :prod_audit, after prod_monitor_deploy, 1d
Deploy Binary Authorization or OPA Gatekeeper policies :prod_binauth, after prod_addons2, 1d
Create K8s namespaces and service accounts :prod_ns, after prod_imgupd, 1d
DNS delegation for prod :prod_dns, after prod_imgupd, 1d
ArgoCD repo, project, app with Image Updater annotations :prod_argo, after prod_ns, 1d
Initial sync and health validation :prod_sync, after prod_argo, 1d
Secrets setup with CMEK and rotation :prod_secrets, after prod_sync, 1d
Image scanning and chart provenance verification :prod_imgscan, after prod_sync, 1d
Keycloak config - realm, clients, IdP, RBAC, MFA :prod_kc, after prod_secrets prod_dns, 2d
ArgoCD RBAC hardening and admin account controls :prod_argohard, after prod_sync, 1d
End-to-end smoke test :prod_smoke, after prod_kc prod_argohard, 2d
Smoke test gate - all health checks pass :milestone, prod_smoke_gate, after prod_smoke, 0d
Backup restore and Cloud SQL HA failover drill :prod_backup, after prod_smoke_gate, 1d
Load testing against defined SLOs and capacity targets :prod_loadtest, after prod_smoke_gate, 2d
Load test gate - meets p95 latency and error thresholds :milestone, prod_load_gate, after prod_loadtest, 0d
Production deployment complete :milestone, prod_done, after prod_backup prod_load_gate, 0d
section Phase 8 - Go-Live [Joint]
Customer UAT on production :prod_uat, after prod_done, 3d
SLO dashboards and alert routing validation :prod_monitor, after prod_done, 2d
Incident runbooks and escalation procedures :prod_runbooks, after prod_done, 3d
Security assessment and vulnerability scan :prod_secassess, after prod_done, 2d
Security gate - no critical or high vulnerabilities :milestone, prod_sec_gate, after prod_secassess, 0d
Rollback drill - ArgoCD, Terraform state, DB migration :prod_rollback, after prod_done, 2d
Cross-region DR failover drill (if DR selected) :prod_dr_drill, after prod_rollback, 2d
GitOps repository failover test (if DR selected) :prod_git_dr, after prod_dr_drill, 1d
Go-live approval :milestone, golive, after prod_uat prod_sec_gate prod_monitor prod_runbooks prod_git_dr, 0d
Go-live cutover - DNS and final alerting :cutover, after golive, 1d
Go-live complete :milestone, live, after cutover, 0d
section Phase 9 - Post Go-Live [Vendor]
Hypercare - vendor-monitored, on-call :hypercare, after live, 10d
Handover to steady-state support :milestone, steady, after hypercare, 0d
section Ongoing - App Image Updates [Staging Auto, Prod Gated]
New image pushed to container registry :upd_img_push, after steady, 1d
Image Updater detects new tag in staging :upd_img_detect, after upd_img_push, 0d
Image Updater writes tag back to Git (staging branch) :upd_img_write, after upd_img_detect, 0d
ArgoCD syncs new image to staging cluster :upd_img_sync_stg, after upd_img_write, 0d
Vendor validates staging deployment health :upd_img_validate_stg, after upd_img_sync_stg, 1d
Vendor opens promotion PR for production :upd_img_promote_pr, after upd_img_validate_stg, 1d
Customer approves production promotion PR :upd_img_approve, after upd_img_promote_pr, 1d
ArgoCD syncs approved image to production :upd_img_sync_prod, after upd_img_approve, 0d
Vendor validates production deployment health :upd_img_validate_prod, after upd_img_sync_prod, 1d
Image update complete :milestone, upd_img_done, after upd_img_validate_prod, 0d
section Ongoing - Infra and Config Changes [Vendor-Managed]
Vendor proposes change via protected branch PR :upd_propose, after upd_img_done, 2d
Image scanning and chart provenance verification :upd_scan, after upd_propose, 1d
Customer reviews and approves with required reviewers :upd_approve, after upd_scan, 2d
Vendor merges within ArgoCD sync window :upd_deploy, after upd_approve, 1d
Vendor validates deployment health :upd_validate, after upd_deploy, 1d
Update cycle complete :milestone, upd_done, after upd_validate, 0d
- Vendor owns execution of all Terraform and application deployment tasks
- Two-layer deployment: Terraform deploys cluster add-ons via
helm_releaseresources (nginx, cert-manager, external-secrets, external-dns, ArgoCD); ArgoCD then manages only the application workloads via GitOps - Customer touchpoints: prerequisites (incl. billing budget alerts), design sign-off, UAT (x2), go-live approval
- GKE topology: regional cluster, multi-zone node pools, HPA/VPA autoscaling, PodDisruptionBudgets
- Cloud SQL: PostgreSQL 15, private IP via VPC peering, Regional HA for production, failover drill before go-live
- Security hardening: Cloud Armor WAF with pre-configured OWASP Top 10 rulesets (SQL injection, XSS, RCE, LFI) and rate-limiting policies applied to the nginx-ingress backend service; Pod Security Standards (Restricted profile); NetworkPolicies (default-deny with explicit allow rules); Keycloak RBAC with MFA; ArgoCD admin controls; image admission policies; GKE Application-layer Secrets Encryption with CMEK
- Observability (all GCP-native): GKE Managed Prometheus for metrics collection (PromQL-compatible, stored in Cloud Monitoring); Cloud Monitoring for dashboards, SLOs, and alerting; Cloud Logging for centralized log aggregation (enabled by default on GKE); Cloud Audit Logs for security and compliance — no self-managed observability stack to operate
- App image updates: staging is automatic via ArgoCD Image Updater with Git write-back for auditability; production requires a promotion PR approved by the customer before ArgoCD syncs — new images are validated in staging first, then promoted via a gated workflow
- Infra and config changes: vendor submits PR → IaC validation (tflint, tfsec) → image scan + provenance check → customer reviews diff → vendor merges within ArgoCD sync window
- Rollback: validated drill covering ArgoCD rollback, Terraform state recovery, and DB migration backout using Flyway/Liquibase with backward-compatible migrations
- WIF access: tightly scoped custom IAM roles via SA impersonation, full audit trail with log sinks and security alerting, no long-lived credentials
- Supply chain security: Binary Authorization / OPA Gatekeeper enforced as mandatory pre-production milestone with exception management
- GKE clusters: private clusters with master authorized networks, Shielded Nodes (requires
shielded_instance_config { enable_secure_boot = true, enable_integrity_monitoring = true }in the GKE node pool Terraform config) - Private endpoint policy: The architecture review must decide whether to fully disable the public control plane endpoint (
enable_private_endpoint = true) or restrict access via master authorized networks only. The existing Terraform module defaults toenable_private_endpoint = false; update totrueif the customer's security policy requires it and a validated bastion/VPN access path is confirmed. - Maintenance lifecycle: defined GKE upgrade cadence, add-on versioning, and patching policy for post-go-live
- Deliverables: architecture diagram, runbooks, access matrix, SLO definitions, maintenance schedule
Customer IT performs all provisioning and operations. The vendor provides training, documentation, review checkpoints, and validation at key milestones.
gantt
title Customer-Managed Deployment
dateFormat YYYY-MM-DD
axisFormat %b %d
section Phase 0 - Prerequisites [Customer]
Provision GCP project and set billing budget alerts :cust_gcp, 2026-03-02, 3d
Create VPC and subnets :cust_vpc, after cust_gcp, 2d
Establish on-prem connectivity via VPN :cust_vpn, after cust_gcp, 5d
Set up IAM roles for customer IT team :cust_iam, after cust_vpc, 2d
Provision bastion host or validate VPN kubectl access path :cust_access, after cust_vpn, 2d
Prerequisites complete :milestone, prereq_done, after cust_iam cust_access, 0d
section Phase 1 - Training GCP and Terraform [Vendor-led]
GCP fundamentals - IAM, VPC, GKE :train_gcp, after prereq_done, 2d
Terraform modules walkthrough :train_tf, after train_gcp, 2d
Terraform state management and best practices :train_state, after train_tf, 1d
Hands-on lab - deploy dev environment guided :train_lab1, after train_state, 2d
Training checkpoint :milestone, train1_done, after train_lab1, 0d
section Phase 2 - Training Kubernetes and Add-ons [Vendor-led]
GKE cluster operations and node management :train_gke, after train1_done, 1d
Helm add-ons walkthrough :train_helm, after train_gke, 2d
Workload Identity and service account bindings :train_wi, after train_helm, 1d
Cloud SQL operations and post-deployment steps :train_sql, after train_gke, 1d
Training checkpoint :milestone, train2_done, after train_wi train_sql, 0d
section Phase 3 - Training ArgoCD and GitOps [Vendor-led]
ArgoCD concepts, architecture, RBAC :train_argo, after train2_done, 1d
ArgoCD applications, projects, sync policies :train_argoapp, after train_argo, 1d
GitOps workflow - CICD and ArgoCD integration :train_gitops, after train_argoapp, 1d
Monitoring, alerting, troubleshooting runbooks :train_ops, after train_gitops, 1d
Hands-on lab - full stack deployment guided :train_lab2, after train_ops, 2d
All training complete :milestone, train_done, after train_lab2, 0d
section Phase 4 - Planning and Design [Joint]
Architecture review - regional GKE, HA, SLOs, RTO-RPO :arch, after train_done, 2d
Network design - VPC, subnets, NAT, firewall, peering :netdesign, after arch, 1d
Customer prepares terraform.tfvars :envconfig, after arch, 2d
Vendor reviews customer config :vendorreview, after envconfig, 1d
Select vendor validation model (access, evidence, or screen-share) :valmodel, after vendorreview, 1d
Design sign-off :milestone, design_approved, after valmodel netdesign, 0d
section Phase 5 - Staging Infrastructure [Customer IT]
Create Terraform state bucket for staging :stg_state, after design_approved, 1d
Create KMS key and grant service agent roles :stg_kms, after stg_state, 1d
Terraform plan review with vendor :stg_planrev, after stg_kms, 1d
Terraform - GCP APIs, VPC peering, Cloud NAT :stg_net, after stg_planrev, 2d
Terraform - regional GKE cluster with monitoring_config and database_encryption :stg_gke, after stg_net, 2d
Terraform - Cloud SQL PostgreSQL 15 private IP :stg_sql, after stg_net, 2d
Terraform - GCS buckets :stg_gcs, after stg_net, 1d
Terraform - Workload Identity :stg_wi, after stg_gke, 1d
Validate GMP metrics collection and Secrets encryption :stg_gmpval, after stg_gke, 1d
Cloud SQL post-deploy - extensions and users :stg_sqlpost, after stg_sql stg_gke, 1d
Security hardening - PSS, NetworkPolicies, Shielded Nodes :stg_security, after stg_gke, 1d
IaC validation - terraform validate, tflint, tfsec :stg_iactest, after stg_security, 1d
IaC gate - zero critical tfsec findings :milestone, stg_iac_gate, after stg_iactest, 0d
Vendor validates staging infra :stg_validate, after stg_wi stg_sqlpost stg_gcs stg_iac_gate stg_gmpval, 1d
section Phase 6 - Staging Add-ons and Apps [Customer IT]
Deploy nginx-ingress with Cloud Armor WAF policy :stg_addons1, after stg_validate, 2d
Deploy external-secrets and external-dns :stg_addons2, after stg_addons1, 1d
Deploy ArgoCD :stg_argo, after stg_addons1, 1d
Deploy Image Updater with Git write-back and token CronJob :stg_imgupd, after stg_argo, 1d
Enable GKE Managed Prometheus and Cloud Monitoring :stg_monitor, after stg_argo, 1d
Configure Cloud Logging sinks, retention, and alerts :stg_audit, after stg_monitor, 1d
Deploy Binary Authorization or OPA Gatekeeper policies :stg_binauth, after stg_argo, 1d
Create K8s namespaces and service accounts :stg_ns, after stg_imgupd, 1d
DNS delegation and NS record verification :stg_dns, after stg_addons2, 2d
Configure ArgoCD - repo, project, app with Image Updater annotations :stg_argoconf, after stg_imgupd stg_ns, 2d
Initial sync and health validation :stg_sync, after stg_argoconf, 1d
Secrets setup with CMEK and rotation :stg_secrets, after stg_sync, 1d
Image scanning and chart provenance verification :stg_imgscan, after stg_sync, 1d
Keycloak config - realm, clients, IdP, RBAC, MFA :stg_kc, after stg_secrets, 2d
ArgoCD RBAC hardening and admin account controls :stg_argohard, after stg_sync, 1d
Secrets audit - no secrets in Git, least-privilege :stg_secaudit, after stg_secrets, 1d
SSL certificate validation :stg_ssl, after stg_dns stg_sync, 1d
End-to-end smoke test :stg_smoke, after stg_kc stg_ssl stg_secaudit stg_imgscan stg_argohard, 2d
Smoke test gate - all health checks pass :milestone, stg_smoke_gate, after stg_smoke, 0d
Load testing against defined SLOs and capacity targets :stg_loadtest, after stg_smoke_gate, 2d
Load test gate - meets p95 latency and error thresholds :milestone, stg_load_gate, after stg_loadtest, 0d
Backup restore and Cloud SQL failover drill :stg_backup, after stg_smoke_gate, 1d
Vendor validates staging deployment :stg_vendorval, after stg_backup stg_load_gate, 1d
Staging deployment complete :milestone, stg_done, after stg_vendorval, 0d
section Phase 7 - Staging UAT [Customer]
Customer UAT on staging :stg_uat, after stg_done, 5d
Bug fixes and config adjustments :stg_fix, after stg_uat, 3d
Customer staging sign-off :milestone, stg_signoff, after stg_fix, 0d
section Phase 8 - Production Infrastructure [Customer IT]
Create Terraform state bucket for prod :prod_state, after stg_signoff, 1d
Create KMS key for prod and grant service agent roles :prod_kms, after prod_state, 1d
Terraform - GCP APIs, VPC peering, Cloud NAT :prod_net, after prod_kms, 1d
Terraform - regional GKE cluster with monitoring_config and database_encryption :prod_gke, after prod_net, 2d
Terraform - Cloud SQL Regional HA and GCS buckets :prod_sql, after prod_net, 2d
Terraform - Workload Identity :prod_wi, after prod_gke, 1d
Validate GMP metrics collection and Secrets encryption :prod_gmpval, after prod_gke, 1d
Cloud SQL post-deploy :prod_sqlpost, after prod_sql prod_gke, 1d
Security hardening - PSS, NetworkPolicies, Shielded Nodes :prod_security, after prod_gke, 1d
IaC validation - terraform validate, tflint, tfsec :prod_iactest, after prod_security, 1d
IaC gate - zero critical tfsec findings :milestone, prod_iac_gate, after prod_iactest, 0d
section Phase 9 - Production Add-ons and Apps [Customer IT]
Deploy nginx-ingress with Cloud Armor WAF, cert-manager :prod_addons, after prod_wi, 2d
Deploy ArgoCD, external-secrets, external-dns :prod_addons2, after prod_addons, 1d
Deploy Image Updater with Git write-back and token CronJob :prod_imgupd, after prod_addons2, 1d
Enable GKE Managed Prometheus and Cloud Monitoring :prod_monitor_deploy, after prod_addons2, 1d
Configure Cloud Logging sinks, retention, and alerts :prod_audit, after prod_monitor_deploy, 1d
Deploy Binary Authorization or OPA Gatekeeper policies :prod_binauth, after prod_addons2, 1d
ArgoCD config - repo, project, app with Image Updater annotations :prod_argo, after prod_imgupd, 2d
Initial sync and health validation :prod_sync, after prod_argo, 1d
Secrets setup with CMEK and rotation :prod_secrets, after prod_sync, 1d
Image scanning and chart provenance verification :prod_imgscan, after prod_sync, 1d
Keycloak config - realm, clients, IdP, RBAC, MFA :prod_kc, after prod_secrets, 2d
ArgoCD RBAC hardening and admin account controls :prod_argohard, after prod_sync, 1d
SSL certificate and DNS validation :prod_ssl, after prod_addons prod_sync, 2d
End-to-end smoke test :prod_smoke, after prod_kc prod_ssl prod_argohard, 2d
Smoke test gate - all health checks pass :milestone, prod_smoke_gate, after prod_smoke, 0d
Backup restore and Cloud SQL HA failover drill :prod_backup, after prod_smoke_gate, 1d
Load testing against defined SLOs and capacity targets :prod_loadtest, after prod_smoke_gate, 2d
Load test gate - meets p95 latency and error thresholds :milestone, prod_load_gate, after prod_loadtest, 0d
Vendor validates production deployment :prod_vendorval, after prod_backup prod_load_gate, 1d
Production deployment complete :milestone, prod_done, after prod_vendorval, 0d
section Phase 10 - Go-Live [Customer]
Customer UAT on production :prod_uat, after prod_done, 3d
SLO dashboards and alert routing validation :prod_monitor, after prod_done, 2d
Incident runbooks and escalation procedures :prod_runbooks, after prod_done, 3d
Security assessment and vulnerability scan :prod_secassess, after prod_done, 2d
Security gate - no critical or high vulnerabilities :milestone, prod_sec_gate, after prod_secassess, 0d
Rollback drill - ArgoCD, Terraform state, DB migration :prod_rollback, after prod_done, 2d
Cross-region DR failover drill (if DR selected) :prod_dr_drill, after prod_rollback, 2d
GitOps repository failover test (if DR selected) :prod_git_dr, after prod_dr_drill, 1d
Go-live approval :milestone, golive, after prod_uat prod_sec_gate prod_monitor prod_runbooks prod_git_dr, 0d
Go-live cutover - DNS and final alerting :cutover, after golive, 1d
Go-live complete :milestone, live, after cutover, 0d
section Phase 11 - Post Go-Live
Hypercare - vendor tier-2 tier-3, customer monitors :hypercare, after live, 10d
Incident response drills :postgo_drills, after live, 2d
Knowledge transfer - advanced runbooks :kt_runbooks, after live, 3d
Finalize operational readiness deliverables :kt_deliverables, after live, 5d
On-call rotation setup and escalation procedures :kt_oncall, after kt_runbooks, 2d
Handover to customer-managed steady-state :milestone, steady, after hypercare kt_runbooks kt_deliverables kt_oncall, 0d
section Ongoing - App Image Updates [Automatic]
New image pushed to container registry :upd_img_push, after steady, 1d
Image Updater detects tag, writes back to Git :upd_img_stg, after upd_img_push, 0d
ArgoCD syncs new image to staging :upd_img_stgsync, after upd_img_stg, 0d
Customer validates staging :upd_img_stgval, after upd_img_stgsync, 1d
Customer promotes to production via sync window :upd_img_prod, after upd_img_stgval, 1d
Image update complete :milestone, upd_img_done, after upd_img_prod, 0d
section Ongoing - Infra and Config Changes [Customer-Managed]
Vendor publishes release notes and Terraform changes :upd_release, after upd_img_done, 1d
Image scanning and chart provenance verification :upd_scan, after upd_release, 1d
Customer IT evaluates changes :upd_eval, after upd_scan, 3d
Customer applies Terraform to staging :upd_stg, after upd_eval, 2d
Customer validates staging :upd_stgval, after upd_stg, 2d
Customer applies to production via sync window :upd_prod, after upd_stgval, 1d
Customer validates production :upd_prodval, after upd_prod, 1d
Update cycle complete :milestone, upd_done, after upd_prodval, 0d
- ~3 weeks of training before execution begins (GCP, Terraform, GKE, Helm, ArgoCD, GitOps)
- Customer IT owns execution with vendor review checkpoints at each milestone
- Two-layer deployment: Terraform deploys cluster add-ons via
helm_releaseresources (nginx, cert-manager, external-secrets, external-dns, ArgoCD); ArgoCD then manages only the application workloads via GitOps - GKE topology: regional cluster, multi-zone node pools, HPA/VPA autoscaling, PodDisruptionBudgets
- Cloud SQL: PostgreSQL 15, private IP via VPC peering, Regional HA for production, failover drill before go-live
- Security hardening: Cloud Armor WAF with pre-configured OWASP Top 10 rulesets (SQL injection, XSS, RCE, LFI) and rate-limiting policies; Pod Security Standards (Restricted profile); NetworkPolicies (default-deny with explicit allow rules); Keycloak RBAC with MFA; ArgoCD admin controls; image admission policies; GKE Application-layer Secrets Encryption with CMEK
- Observability (all GCP-native): GKE Managed Prometheus for metrics collection (PromQL-compatible, stored in Cloud Monitoring); Cloud Monitoring for dashboards, SLOs, and alerting; Cloud Logging for centralized log aggregation (enabled by default on GKE); Cloud Audit Logs for security and compliance — no self-managed observability stack to operate
- Vendor validates staging infra, staging deployment, and production deployment before progression (via defined validation model -- see Vendor Validation Model section)
- App image updates: automatic via ArgoCD Image Updater with Git write-back — new images detected, tag committed to Git via least-privilege token for auditability; customer validates staging then promotes to production
- Infra and config changes: vendor publishes release notes → IaC validation (tflint, tfsec) → customer evaluates, tests on staging, promotes to production within sync window
- Rollback: validated drill covering ArgoCD rollback, Terraform state recovery, and DB migration backout using Flyway/Liquibase with backward-compatible migrations
- No vendor access to customer GCP — customer uses their own IAM credentials; vendor validation occurs via defined model (time-bound access, evidence, or screen-share)
- Supply chain security: Binary Authorization / OPA Gatekeeper enforced as mandatory pre-production milestone
- GKE clusters: private clusters with master authorized networks, Shielded Nodes
- Private endpoint policy: Same as vendor-managed -- architecture review decides
enable_private_endpointsetting based on customer security requirements and confirmed bastion/VPN access path. - Maintenance lifecycle: defined GKE upgrade cadence, add-on versioning, and patching policy for post-go-live
- Deliverables: architecture diagram, runbooks (incident response, routine operations, troubleshooting), access matrix, SLO definitions with dashboard templates, incident response and escalation procedures, maintenance schedule, DR procedures, on-call rotation template
| Dimension | Vendor-Managed | Customer-Managed |
|---|---|---|
| Go-live timeline | ~9 weeks | ~13 weeks |
| Training overhead | None | ~3 weeks |
| Customer IT effort | Minimal (approvals + UAT) | Heavy (all execution) |
| Deployment risk | Low (vendor expertise) | Medium (learning curve) |
| Time to staging | ~3 weeks after prereqs | ~6 weeks after prereqs |
| Ongoing update effort (customer) | Review & approve (~hours) | Evaluate, test, apply (~days) |
| Vendor dependency (ongoing) | High | Low |
| Operational self-sufficiency | Low | High |
| Access model | WIF (scoped, auditable, revocable) | Customer-only IAM |
| Deployment project cost | Lower (faster) | Higher (training + longer timeline) |
| Ongoing operations cost | Higher (vendor management fee) | Lower (internal team) |
| Security posture | Good (WIF, audit trail) | Better (no external access) |
The deployment maximizes GCP-managed services to reduce operational overhead. Cluster add-ons are used only where GCP lacks an equivalent or where the existing Terraform modules already configure them.
Deployment mechanism (two-layer architecture):
- Layer 1 — Terraform (
cluster-addons.tf): Deploys all cluster add-ons viahelm_releaseresources as part ofterraform apply. This includes nginx-ingress, cert-manager, external-secrets, external-dns, and ArgoCD itself. Terraform also creates namespaces, service accounts, and ClusterIssuers. - Layer 2 — ArgoCD: Manages only the application workloads. ArgoCD watches the application repository (separate from the infra repo) and syncs Helm releases to the
em-semi-app,em-semi-workflow, andem-semi-keycloaknamespaces. ArgoCD does not manage its own add-ons or other cluster infrastructure.
| Category | Service | Type | Rationale |
|---|---|---|---|
| Compute | GKE (regional, private) | GCP-native | Primary workload platform |
| Database | Cloud SQL (PostgreSQL 15) | GCP-native | Managed HA, automated backups, PITR |
| Object Storage | Cloud Storage (GCS) | GCP-native | Data, cache, workflows, logs buckets |
| DNS | Cloud DNS | GCP-native | Managed zones, DNSSEC |
| DNS Automation | external-dns (v1.14.0) | Cluster add-on | K8s-native DNS record lifecycle; uses Cloud DNS as provider |
| NAT | Cloud NAT | GCP-native | Outbound internet for private nodes |
| IAM | Cloud IAM + Workload Identity | GCP-native | Pod-level identity, no key files |
| Secrets Backend | Secret Manager | GCP-native | CMEK encryption, audit trail, rotation |
| Secrets Sync | external-secrets (v0.9.11) | Cluster add-on | Bridges Secret Manager to K8s Secrets; standard GitOps pattern |
| Ingress | nginx-ingress (v4.9.0) | Cluster add-on | Already configured in repo with LoadBalancer service type |
| TLS Certificates | cert-manager (v1.14.0) | Cluster add-on | Let's Encrypt + DNS-01 via Cloud DNS; already configured in repo |
| GitOps | ArgoCD (v5.51.6) | Cluster add-on | No GCP equivalent; core deployment mechanism |
| Image Updates | ArgoCD Image Updater | Cluster add-on | Git write-back for auditability; no GCP equivalent |
| Metrics | GKE Managed Prometheus | GCP-native | Managed collection pipeline, PromQL-compatible, Cloud Monitoring storage |
| Dashboards | Cloud Monitoring | GCP-native | SLOs, alerting policies, custom dashboards |
| Logging | Cloud Logging | GCP-native | Enabled by default on GKE; Log Router for export |
| Audit | Cloud Audit Logs | GCP-native | Admin Activity, Data Access, System Event logs |
| Container Registry | Artifact Registry | GCP-native | Docker image storage, vulnerability scanning |
| Build | Cloud Build | GCP-native | CI/CD (API enabled) |
| Identity | Keycloak | Cluster add-on | Application-level IdP; GCP equivalent (Identity Platform) doesn't cover all use cases |
| Query Analytics | Cloud SQL Query Insights | GCP-native | Built-in query performance analysis; enabled by default in Terraform module |
| Notifications | ArgoCD Notifications | Cluster add-on | Slack, webhook, email notifications for deployment events; configured via annotations |
| Admission Control | Binary Authorization or OPA Gatekeeper | GCP-native or add-on | Binary Authorization is GCP-native; OPA Gatekeeper is an alternative |
Self-managed components eliminated by using GCP-native services:
Prometheus server→ GKE Managed Prometheus (managed collection, Cloud Monitoring backend)Grafana→ Cloud Monitoring dashboards (optional: deploy Grafana with Cloud Monitoring data source for advanced visualization)Loki + Promtail→ Cloud Logging (enabled by default, zero deployment)Alertmanager→ Cloud Monitoring alerting policies with notification channels
| Phase | Action | Actor | Automation | Tooling |
|---|---|---|---|---|
| Prerequisites | Provision GCP project, billing alerts | Customer | Manual | GCP Console, gcloud |
| Create VPC, subnets, firewall rules | Customer | Manual | GCP Console or Terraform | |
| Establish on-prem VPN connectivity | Customer | Manual | Cloud VPN, network appliance | |
| Configure WIF pool and OIDC provider | Customer | Manual | gcloud, Terraform | |
| Create deployment SAs and assign IAM roles | Customer | Manual | gcloud, Terraform | |
| Validate WIF and SA impersonation | Vendor | Manual | gcloud auth, CI test job | |
| Planning | Kickoff, architecture review, network design | Joint | Manual | Meetings, documentation |
| Environment config (terraform.tfvars) | Vendor | Manual | Text editor | |
| Customer design sign-off | Customer | Manual | Approval gate | |
| Infrastructure | Terraform plan and apply (GKE, Cloud SQL, GCS, WI) | Vendor | Semi-automated | Terraform CLI or CI/CD pipeline |
| IaC validation (tflint, tfsec, checkov) | Vendor | Automated | CI pipeline on PR | |
| Security hardening (PSS, NetworkPolicies) | Vendor | Manual | kubectl, Terraform | |
| Cloud SQL post-deploy (extensions, users) | Vendor | Manual | psql, gcloud | |
| Add-ons | Deploy cluster add-ons via Terraform helm_release (nginx, cert-manager, external-secrets, external-dns, ArgoCD) | Vendor | Semi-automated | Terraform (helm_release resources) |
| Enable GKE Managed Prometheus, create Cloud Monitoring dashboards | Vendor | Semi-automated | Terraform, gcloud | |
| Configure Cloud Logging sinks and alerts | Vendor | Manual | gcloud, Terraform | |
| Deploy Binary Authorization or OPA Gatekeeper | Vendor | Semi-automated | Terraform, gcloud | |
| DNS delegation and NS record setup | Vendor | Manual | Cloud DNS, customer DNS registrar | |
| App Deployment | Apply ArgoCD Project and Application manifests | Vendor | Manual | kubectl apply (ArgoCD CRDs) |
| Initial sync and health validation | Vendor | Automated | ArgoCD auto-sync | |
| Keycloak config (realm, clients, IdP, RBAC, MFA) | Vendor | Manual | Keycloak admin console | |
| Populate secrets in GCP Secret Manager | Vendor | Manual | gcloud | |
| SSL certificate validation | Vendor | Manual | curl, openssl | |
| End-to-end smoke test | Vendor | Semi-automated | Test scripts, manual verification | |
| Load testing | Vendor | Semi-automated | k6, Locust, or similar | |
| Backup restore and failover drill | Vendor | Manual | gcloud, psql | |
| UAT | User acceptance testing | Customer | Manual | Application UI |
| Bug fixes and config adjustments | Vendor | Manual | Code changes, Terraform | |
| Customer sign-off | Customer | Manual | Approval gate | |
| Go-Live | Customer UAT on production | Customer | Manual | Application UI |
| SLO dashboards and alert routing validation | Joint | Manual | Cloud Monitoring | |
| Incident runbooks and escalation procedures | Vendor | Manual | Documentation | |
| Security assessment and vulnerability scan | Joint | Semi-automated | Scanner tools | |
| Rollback drill | Vendor | Manual | ArgoCD, Terraform, psql | |
| DNS cutover and final alerting | Vendor | Manual | Cloud DNS, Cloud Monitoring | |
| Go-live approval | Customer | Manual | Approval gate | |
| Post Go-Live | Hypercare monitoring and on-call | Vendor | Manual + automated alerts | Cloud Monitoring, PagerDuty |
| Ongoing - Images | New image pushed to registry | Vendor CI | Automated | GitHub Actions, Artifact Registry |
| Image Updater detects new tag (staging) | Automatic | Automated | ArgoCD Image Updater (polling) | |
| Tag written back to Git (staging branch) | Automatic | Automated | Image Updater Git write-back | |
| ArgoCD syncs new image to staging | Automatic | Automated | ArgoCD auto-sync | |
| Vendor validates staging deployment | Vendor | Manual | kubectl, dashboards | |
| Vendor opens promotion PR for production | Vendor | Manual | Git, GitHub | |
| Customer approves production promotion PR | Customer | Manual | GitHub PR review | |
| ArgoCD syncs approved image to production | Automatic | Automated | ArgoCD auto-sync | |
| Vendor validates production deployment | Vendor | Manual | kubectl, dashboards | |
| Ongoing - Infra | Vendor proposes change via PR | Vendor | Manual | Git, GitHub |
| IaC validation and image scanning on PR | Automatic | Automated | CI pipeline | |
| Customer reviews and approves PR | Customer | Manual | GitHub PR review | |
| Vendor merges and runs terraform apply | Vendor | Semi-automated | Git merge, Terraform CLI or CI/CD | |
| ArgoCD syncs app-layer changes (if any) | Automatic | Automated | ArgoCD auto-sync | |
| Deployment health validation | Vendor | Manual | kubectl, Cloud Monitoring |
| Phase | Action | Actor | Automation | Tooling |
|---|---|---|---|---|
| Prerequisites | Provision GCP project, billing alerts | Customer | Manual | GCP Console, gcloud |
| Create VPC, subnets, IAM roles | Customer | Manual | GCP Console or Terraform | |
| Establish on-prem VPN connectivity | Customer | Manual | Cloud VPN, network appliance | |
| Training | GCP fundamentals, Terraform, GKE, Helm, ArgoCD | Vendor-led | Manual | Workshops, hands-on labs |
| Hands-on labs (dev environment deployment) | Customer | Manual (guided) | Terraform, kubectl, Helm | |
| Planning | Architecture review, network design | Joint | Manual | Meetings, documentation |
| Customer prepares terraform.tfvars | Customer | Manual | Text editor | |
| Vendor reviews customer config | Vendor | Manual | Code review | |
| Select vendor validation model | Joint | Manual | Decision gate | |
| Design sign-off | Customer | Manual | Approval gate | |
| Infrastructure | Terraform plan review with vendor | Joint | Manual | Terraform plan output |
| Terraform plan and apply (GKE, Cloud SQL, GCS, WI) | Customer IT | Semi-automated | Terraform CLI or CI/CD pipeline | |
| IaC validation (tflint, tfsec, checkov) | Customer IT | Automated | CI pipeline on PR | |
| Security hardening (PSS, NetworkPolicies) | Customer IT | Manual | kubectl, Terraform | |
| Cloud SQL post-deploy (extensions, users) | Customer IT | Manual | psql, gcloud | |
| Vendor validates staging infra | Vendor | Manual | Per validation model (A/B/C) | |
| Add-ons | Deploy cluster add-ons via Terraform helm_release (nginx, cert-manager, external-secrets, external-dns, ArgoCD) | Customer IT | Semi-automated | Terraform (helm_release resources) |
| Enable GKE Managed Prometheus, create Cloud Monitoring dashboards | Customer IT | Semi-automated | Terraform, gcloud | |
| Configure Cloud Logging sinks and alerts | Customer IT | Manual | gcloud, Terraform | |
| Deploy Binary Authorization or OPA Gatekeeper | Customer IT | Semi-automated | Terraform, gcloud | |
| DNS delegation and NS record setup | Customer IT | Manual | Cloud DNS | |
| App Deployment | Apply ArgoCD Project and Application manifests | Customer IT | Manual | kubectl apply (ArgoCD CRDs) |
| Initial sync and health validation | Customer IT | Automated | ArgoCD auto-sync | |
| Keycloak config (realm, clients, IdP, RBAC, MFA) | Customer IT | Manual | Keycloak admin console | |
| Secrets setup in GCP Secret Manager | Customer IT | Manual | gcloud | |
| End-to-end smoke test | Customer IT | Semi-automated | Test scripts, manual verification | |
| Load testing | Customer IT | Semi-automated | k6, Locust, or similar | |
| Backup restore and failover drill | Customer IT | Manual | gcloud, psql | |
| Vendor validates staging deployment | Vendor | Manual | Per validation model (A/B/C) | |
| UAT | User acceptance testing | Customer | Manual | Application UI |
| Bug fixes and config adjustments | Customer IT | Manual | Code changes, Terraform | |
| Customer sign-off | Customer | Manual | Approval gate | |
| Go-Live | Customer UAT on production | Customer | Manual | Application UI |
| SLO dashboards and alert routing validation | Customer | Manual | Cloud Monitoring | |
| Incident runbooks and escalation procedures | Customer | Manual | Documentation | |
| Security assessment and vulnerability scan | Customer | Semi-automated | Scanner tools | |
| Rollback drill | Customer IT | Manual | ArgoCD, Terraform, psql | |
| DNS cutover and final alerting | Customer IT | Manual | Cloud DNS, Cloud Monitoring | |
| Go-live approval | Customer | Manual | Approval gate | |
| Post Go-Live | Hypercare monitoring (customer primary, vendor tier-2/3) | Joint | Manual + automated alerts | Cloud Monitoring, PagerDuty |
| Knowledge transfer (advanced runbooks) | Vendor | Manual | Documentation, workshops | |
| Ongoing - Images | New image pushed to registry | Vendor CI | Automated | GitHub Actions, Artifact Registry |
| Image Updater detects new tag | Automatic | Automated | ArgoCD Image Updater (polling) | |
| Tag written back to Git | Automatic | Automated | Image Updater Git write-back | |
| ArgoCD syncs new image to staging | Automatic | Automated | ArgoCD auto-sync | |
| Customer validates staging | Customer | Manual | Application UI, dashboards | |
| Customer promotes to production via sync window | Customer | Manual | ArgoCD sync or Git merge | |
| Ongoing - Infra | Vendor publishes release notes and Terraform changes | Vendor | Manual | Git, documentation |
| IaC validation and image scanning | Automatic | Automated | CI pipeline | |
| Customer IT evaluates changes | Customer IT | Manual | Code review, documentation | |
| Customer applies Terraform to staging | Customer IT | Semi-automated | Terraform CLI or CI/CD | |
| Customer validates staging | Customer | Manual | Application UI, dashboards | |
| Customer applies to production via sync window | Customer IT | Semi-automated | Terraform CLI or CI/CD | |
| Customer validates production | Customer | Manual | Application UI, dashboards |
| Level | Definition | Examples |
|---|---|---|
| Automated | Runs without human intervention; triggers on events | ArgoCD auto-sync, Image Updater polling, CI pipeline on PR |
| Semi-automated | Human initiates; tooling executes | terraform apply, helm install, load test run |
| Manual | Human performs directly; requires judgment | Architecture review, Keycloak config, UAT, approval gates |
| Manual + automated alerts | Human monitors; system generates alerts | Hypercare period with Cloud Monitoring alerting |
Vendor-managed is recommended for initial deployments:
- Faster time to value — staging delivered ~3 weeks sooner
- Lower deployment risk — vendor knows the Terraform module dependency chain, Cloud SQL post-deploy steps, Workload Identity binding topology, and ArgoCD sync policy nuances
- WIF eliminates the security concern — scoped IAM roles, Cloud Audit Logs, instant revocation via pool deletion; no long-lived credentials
- Lightweight ongoing model for customer — review a diff, approve, ArgoCD auto-syncs; ~hours not days per update cycle
- Transition path exists — customer can move to self-managed later with condensed training against a working system (more effective than training before deployment)
Choose customer-managed when: the customer has a strong platform engineering team, regulatory constraints prohibit any external infrastructure access, or building deep GCP/K8s competency is a strategic goal.
For the vendor-managed model, WIF is recommended over traditional service account keys:
| Aspect | Service Account Key | Workload Identity Federation |
|---|---|---|
| Credential type | Long-lived JSON key file | Short-lived OIDC tokens |
| Key rotation | Manual, error-prone | Automatic (token-based) |
| Revocation | Delete key, redeploy | Disable WIF pool (instant) |
| Audit trail | Cloud Audit Logs | Cloud Audit Logs + OIDC claims |
| CI/CD integration | Store key as secret | Native GitHub Actions OIDC |
| Blast radius | Key leak = full access until rotated | Token expires in minutes |
Setup (customer responsibility):
- Create a Workload Identity Pool in their GCP project
- Add an OIDC provider (vendor's GitHub Actions or Google Workspace) with attribute constraints (repo, environment, branch)
- Create separate deployment Service Accounts for staging and production with tightly scoped custom IAM roles (never use primitive roles like Editor or broad admin roles). Use custom roles with minimal permissions per phase:
- Infrastructure provisioning:
roles/container.clusterAdmin(notcontainer.admin),roles/cloudsql.editor(notcloudsql.admin),roles/storage.objectAdminon specific buckets - Workload Identity:
roles/iam.serviceAccountUserscoped to specific SAs via IAM Conditions - DNS:
roles/dns.adminscoped to specific managed zones - Secrets:
roles/secretmanager.secretVersionManager(notsecretmanager.admin) - Networking:
roles/compute.networkAdminfor VPC peering and Cloud NAT creation,roles/compute.routerAdminfor Cloud Router management; downgrade toroles/compute.networkUseron specific subnets for runtime workloads after infrastructure provisioning is complete - Prefer custom roles with only the exact permissions required; use IAM Conditions to restrict by resource name, environment label, or time window where possible
- Infrastructure provisioning:
- Grant
roles/iam.workloadIdentityUseron the SA to the WIF pool (SA impersonation pattern) - Share the WIF pool ID and project number with vendor
- Schedule periodic access reviews (quarterly recommended)
- Permission audits: Conduct quarterly reviews of SA permissions using IAM Recommender to identify and remove unused roles
The Terraform modules reference a VPC by name. The existing repo defaults to the default VPC and subnet with auto-allocated secondary IP ranges for GKE pods/services. For customer deployments:
- VPC: The customer's VPC name and subnet CIDR ranges must be specified in
terraform.tfvars. The customer may use an existing shared VPC or a dedicated VPC — this must be resolved during the architecture review (Phase 1) - Subnets: Define secondary IP ranges for GKE pods and services explicitly (the repo currently uses auto-allocation, which should be replaced with planned CIDR ranges for production)
- Cloud SQL: Connects via private IP over VPC peering, which requires the VPC to have Private Services Access configured (the
vpc-peeringmodule handles this) - Cloud DNS: The repo uses a shared managed zone pattern (
data.google_dns_managed_zone.shared) referencing a pre-existing zone, not creating one per environment. For customer deployments, determine whether the customer provides an existing zone or a new one is created via thecloud-dnsmodule
Production clusters should be regional (not zonal) with multi-zone node pools for HA. The architecture review must define:
- Regional cluster with control plane replicated across 3 zones
- Node pool autoscaling ranges (min/max nodes per pool)
- PodDisruptionBudgets for all critical workloads
- HPA (Horizontal Pod Autoscaler) targets for application services
- VPA (Vertical Pod Autoscaler) recommendations for right-sizing
- Cluster Autoscaler configuration for node-level scaling
The architecture review must clarify cross-region DR requirements based on RTO/RPO targets:
- Single-region HA (default): Multi-zone regional GKE + Cloud SQL Regional HA covers zone-level failures; sufficient for most deployments
- Cross-region DR (if required by RTO/RPO): Plan for cross-region Cloud SQL read replica promotion, GCS cross-region replication, and a passive standby GKE cluster in a secondary region
- Decision criteria: If RTO < 1 hour and RPO < 5 minutes for regional outages, cross-region DR is recommended; otherwise, single-region HA with backup/restore is sufficient
- DR drill: If cross-region is implemented, include a DR failover drill in Phase 8 (Go-Live) tasks
- GitOps repository DR: The Git repository is the single source of truth for all infrastructure and application configuration. Ensure the Git hosting provider (e.g., GitHub) has redundancy. Additionally, configure a mirror repository (e.g., GitHub -> Cloud Source Repositories or a self-hosted GitLab) that can serve as a fallback if the primary Git provider experiences an outage. Include Git repository recovery in the DR drill.
The observability stack uses GCP-managed services, eliminating self-managed Prometheus/Grafana/Loki deployments:
Metrics — GKE Managed Prometheus (GMP):
- GMP is a built-in GKE feature (enable
managed_prometheuson the cluster viamonitoring_config { enable_components = ["SYSTEM_COMPONENTS"] managed_prometheus { enabled = true } }in the GKE Terraform module); no Helm charts or PVCs to manage - Implementation note: The existing GKE Terraform module uses the legacy
monitoring_serviceattribute. This must be replaced with themonitoring_configblock to enable GMP. The module update should be included as a prerequisite task in Phase 2 (Staging Infrastructure). - Runs a managed collection pipeline on each node; scrapes Prometheus-format metrics and writes to Cloud Monitoring
- Fully PromQL-compatible — existing dashboards and alerting rules work without modification
- Retention handled by Cloud Monitoring (free tier: 24 months for GCP metrics, custom metrics billed per sample ingested)
- Resource overhead: GMP collection pods run as a DaemonSet with low resource footprint (~50-100MB per node); significantly less than self-managed Prometheus
Dashboards and Alerting — Cloud Monitoring:
- Native SLO monitoring with burn-rate alerting
- Custom dashboards via Cloud Monitoring console or Terraform (
google_monitoring_dashboard) - Alerting policies with notification channels (PagerDuty, Slack, email, Pub/Sub)
- Uptime checks for external endpoint monitoring
- No Grafana deployment needed (optional: deploy Grafana with Cloud Monitoring as a data source if advanced visualization is required)
Logging — Cloud Logging:
- Enabled by default on GKE; no agents to deploy
- Cost management: configure log exclusion filters for high-volume debug/trace logs; route retained logs to GCS or BigQuery at lower cost
- See "Centralized Logging (Cloud Logging)" section for full details
Resource planning:
- Keycloak: resource-intensive; define CPU/memory requests/limits explicitly
- Dedicated node pools: consider for Keycloak in production to isolate from application pods
- Cloud Monitoring costs: estimate custom metrics volume (GMP ingestion) during load testing; use metrics exclusion filters for high-cardinality labels
- Performance validation: validate GMP collection overhead during load testing
The Terraform module defaults to ZONAL availability. For production, terraform.tfvars must explicitly set cloudsql_availability_type = "REGIONAL" to enable automatic failover. The failover drill before go-live validates connection retry behavior, PgBouncer reconnection (if applicable), and application recovery within the defined RTO.
A capacity planning task should be completed during the architecture review (Phase 1) and validated during load testing:
- Cloud SQL sizing: Define vCPU/memory tier based on expected concurrent connections and query volume; the module defaults to
max_connections = 100which may be insufficient for production — adjust via database flags interraform.tfvars - Connection pooling: Deploy PgBouncer as a sidecar or standalone deployment in GKE (Cloud SQL does not provide a native connection pooler). Alternatively, use Cloud SQL Auth Proxy with
--max-connectionsflag. Define max pool size per service based on expected concurrency. - GKE node autoscaler bounds: Set
min_node_countandmax_node_countbased on pod resource requests and expected workload; validate that GCP project quotas (vCPUs, IP addresses, persistent disks) can accommodate the maximum node count - GCP quota checks: Verify regional quotas for Compute Engine, Cloud SQL, GCS, and networking before deployment; request quota increases proactively if needed
- Validation: Load testing results must confirm that the capacity plan supports 2x expected peak load with headroom
Load tests must have defined pass/fail criteria agreed during the architecture review:
- Target throughput (RPS) and latency percentiles (p50, p95, p99)
- Cloud SQL connection pool limits and saturation behavior
- GKE node autoscaling response under load
- Error rate thresholds (e.g., < 0.1% 5xx during steady state)
- Capacity headroom validation (sustain 2x expected peak)
Before any terraform apply, run automated validation:
terraform validate— syntax and internal consistencytflint— Terraform linting with GCP rulesettfsec/checkov— security policy scanning (no public IPs, encryption at rest, etc.)- These checks should be integrated into the CI pipeline for ongoing Terraform PRs
- Image scanning: all container images scanned for CVEs before deployment (Artifact Registry vulnerability scanning or equivalent)
- Chart provenance: Helm chart signatures verified against known publishers
- Admission control: GKE Binary Authorization or OPA Gatekeeper policies to enforce only signed/scanned images are deployed -- this is a mandatory pre-production milestone and must be enabled and validated in staging before production deployment begins
- Policy enforcement scheduling: Binary Authorization / OPA Gatekeeper must be deployed and tested as an explicit task in Phase 3 (staging add-ons) and Phase 7 (production add-ons), with validation that non-compliant images are blocked
- Exception management: Define an exceptions process for policy violations (e.g., break-glass procedure for emergency deployments with post-hoc review)
- SAST/DAST: application security testing is the responsibility of the application CI pipeline (separate from infrastructure deployment)
The Image Updater is configured with Git write-back mode — when it detects a new image tag in the container registry, it commits the updated tag back to the Git repository before ArgoCD syncs. This preserves the GitOps audit trail: every deployed image version has a corresponding Git commit. The ArgoCD Application manifests must include Image Updater annotations specifying which images to watch, the update strategy (semver, latest, digest), and the write-back target branch.
Environment separation for staging-first updates:
The existing repo has all ArgoCD Applications tracking the same main branch with identical auto-sync policies. For customer deployments with staging-first gating, the ArgoCD Applications must be configured differently per environment:
- Staging Application: Include Image Updater annotations (
argocd-image-updater.argoproj.io/image-list,argocd-image-updater.argoproj.io/write-back-method: git) so new images are automatically detected and synced - Production Application: Do NOT include Image Updater annotations — production image updates require a promotion PR that updates the image tag in the production Helm values file, reviewed and approved before ArgoCD syncs
- This configuration is applied during the "Apply ArgoCD Project and Application with Image Updater annotations" task in Phase 4 (vendor-managed) or Phase 6 (customer-managed)
GKE clusters have Cloud Logging enabled by default — system and workload logs are collected automatically via the GKE logging agent (Fluent-bit-based) with no additional deployment required. This eliminates the need to deploy and manage a self-hosted logging stack (e.g., Loki + Promtail):
- Zero deployment overhead: Cloud Logging is a managed service; no Helm charts, persistent volumes, or capacity planning for log storage
- Native GCP integration: Logs Explorer, Log Analytics (BigQuery-backed), Error Reporting, Cloud Trace — all integrated with IAM
- Log Router: Route logs to BigQuery (for analytics), GCS (for long-term archival), or Pub/Sub (for SIEM integration) via configurable sinks
- Log retention: Default 30 days in Cloud Logging; extend via Log Router sinks to GCS (cold) or BigQuery (queryable) for compliance requirements
- Log-based metrics and alerting: Create custom metrics from log entries and configure alerting policies in Cloud Monitoring — no separate alerting stack needed
- Access control: IAM-based permissions (
roles/logging.viewer,roles/logging.privateLogViewer) scoped by project or log bucket - Cost management: Configure log exclusion filters to drop high-volume debug/trace logs before ingestion; use log buckets with different retention periods for cost optimization
- When to consider Loki instead: Only if multi-cloud portability is a hard requirement (i.e., the same deployment must run on AWS/Azure without GCP services)
Terraform state buckets must be configured with the following protections:
- Object versioning: Enabled on the GCS bucket to allow state recovery from accidental corruption or deletion
- State locking: Use GCS-native state locking (default with
gcsbackend) to prevent concurrent modifications - Bucket-level access: Uniform bucket-level access with IAM-only permissions (no ACLs)
- Encryption: Customer-managed encryption keys (CMEK) for state files containing infrastructure details
- Backup: Cross-region replication for disaster recovery of state files (optional, based on data residency and compliance requirements; if disabled due to regulatory constraints, require versioning plus periodic encrypted backups to a secondary bucket)
All GKE clusters (staging and production) must be configured as private clusters:
- Private nodes: Disable public IP addresses on nodes (
enable_private_nodes = true) - Private endpoint: Disable the public control plane endpoint (
enable_private_endpoint = true) or restrict via master authorized networks - Master authorized networks: Limit control plane access to VPN/bastion CIDR ranges only
- Cloud NAT: Required for outbound internet access from private nodes (already included in Terraform)
- Access path: Document the authorized access path (e.g., VPN -> bastion -> kubectl, or Cloud Shell with Private Google Access)
- Prerequisite task: A bastion host or VPN-based kubectl access path must be provisioned and validated as a Phase 0 prerequisite (task
cust_access) before any cluster operations can begin. This task validateskubectl get nodessucceeds through the authorized path and confirms master authorized networks are correctly configured. - This must be validated during the architecture review (Phase 1) and enforced via
tfsecrules
To resolve the contradiction between "no vendor access to customer GCP" and vendor validation requirements, define one of the following validation models during the architecture review:
- Option A: Time-bound read-only access -- Customer grants temporary Viewer IAM role with expiration (e.g., 4-hour session) for vendor validation, revoked immediately after
- Option B: Customer-provided evidence -- Customer provides validation artifacts (screenshots, CLI output, health check reports) using a standardized checklist
- Option C: Supervised screen-share -- Vendor observes customer-run validation commands via video call, providing real-time guidance
- The chosen model must be documented in the architecture review deliverables and referenced in the Gantt chart as a prerequisite for each vendor validation task
Kubernetes Secrets stored in etcd must be encrypted at the application layer using CMEK:
- Enable Application-layer Secrets Encryption on the GKE cluster using a Cloud KMS key (
database_encryption { state = "ENCRYPTED", key_name = "projects/.../cryptoKeys/..." }in Terraform) - Alternative: Use the Secret Manager CSI Driver to mount secrets directly from GCP Secret Manager as volumes, bypassing etcd storage entirely. This eliminates the risk of plaintext secrets in etcd backups.
- Namespace-level RBAC: Restrict
get/list/watchon Secrets resources to only the service accounts that need them (defaultClusterRolegrants are too broad) - Implementation note: The GKE Terraform module must be extended with a
database_encryptionblock. The KMS key must be created in the customer's project and grantedroles/cloudkms.cryptoKeyEncrypterDecrypterto the GKE service agent.
The Git write-back token used by ArgoCD Image Updater must follow these security controls:
- Least privilege: Use a GitHub App or fine-grained personal access token (PAT) scoped to the specific repository with
contents: writeonly - Secure storage: Store the token in GCP Secret Manager (not as a Kubernetes Secret) and sync via External Secrets Operator
- Rotation: Enforce token rotation via a CronJob that refreshes the token from Secret Manager at a defined cadence (e.g., every 30 days)
- Audit: Log all Git write-back commits and monitor for unexpected commit patterns
- Signed commits: Consider enforcing GPG-signed commits for Image Updater write-backs to prevent commit spoofing
- CronJob monitoring: Configure Cloud Monitoring alerts for token rotation CronJob failures (
kube_job_status_failedmetric) and ArgoCD sync state degradation (argocd_app_sync_status{sync_status="OutOfSync"}for prolonged periods). Alert on: CronJob not completing within expected window, consecutive CronJob failures, and ArgoCD unable to push write-back commits.
The existing repo includes ArgoCD notification annotations per environment (Slack channels for deployment events, health degradation, and sync failures). For customer deployments:
- Adapt notification channels: Replace vendor Slack channels with the customer's notification targets (Slack, PagerDuty, email, or webhook)
- Events to configure:
on-deployed,on-health-degraded,on-sync-failed,on-sync-running(production should include all four) - ArgoCD metrics: Enable controller and repo-server metrics in staging and production (disabled by default in dev for cost optimization)
- Implementation: Notification annotations are set on the ArgoCD Application manifests; notification templates and triggers are configured in the ArgoCD ConfigMap
The existing Terraform module enables Cloud SQL Query Insights by default:
query_insights_enabled = truequery_plans_per_minute = 5query_string_length = 1024record_application_tags = true
This provides built-in query performance analysis in the GCP Console without additional tooling. For customer deployments, validate that Query Insights is enabled and review the query plan sample rate during load testing.
Both deployment models must include centralized audit log management:
- Cloud Audit Logs: Ensure Admin Activity, Data Access, and System Event logs are enabled for all GCP services
- Log sinks: Configure a Cloud Logging sink to export audit logs to a dedicated logging project or BigQuery for long-term retention
- Retention: Minimum 1 year retention for compliance; 90 days in Cloud Logging, remainder in cold storage (GCS or BigQuery)
- Security alerts: Configure alerting rules for critical events:
- IAM policy changes and role grants
- ArgoCD admin login attempts and RBAC modifications
- Secret Manager access patterns and unusual secret reads
- GKE control plane access from unexpected IPs
- SIEM integration: If the customer uses a SIEM, configure log forwarding via Pub/Sub or direct integration
Define a maintenance and patching policy for post-go-live operations:
- GKE version upgrades: Select an upgrade channel (rapid/regular/stable) based on the customer's risk profile and change-management requirements during the architecture review; use a staging-first approach with staging upgraded 1 week before production
- Node OS image updates: Enable auto-upgrade for node pools with maintenance windows configured for off-peak hours
- Add-on versioning: Track Helm chart versions for all add-ons (ArgoCD, cert-manager, external-secrets, etc.) and schedule quarterly update reviews
- Compatibility testing: Validate add-on compatibility in staging before production upgrades
- Rollback criteria: Define rollback triggers (e.g., pod crash rate > 5%, health check failures, API errors) and procedures
- Communication: Establish a maintenance notification process with the customer (minimum 48-hour advance notice for production changes)
Define migration tooling and procedures for safe schema changes and rollback:
- Migration tooling: Use Flyway or Liquibase for versioned, repeatable database migrations with a clear migration history
- Backward compatibility: Enforce backward-compatible schema changes (additive only) to enable safe rollback; breaking changes require a multi-phase migration plan
- Pre-migration validation: Run migration dry-runs in staging with production-like data volumes before applying to production
- Post-migration checks: Automated verification of schema state, data integrity, and application health after each migration
- PITR alignment: Validate that Cloud SQL point-in-time recovery (PITR) can restore to a state before migration within the defined RTO/RPO
- Rollback rehearsal: Include migration rollback as part of the pre-go-live rollback drill, testing both Flyway/Liquibase undo and PITR recovery paths
Each deployment phase should have a documented backout procedure:
- Infrastructure phases:
terraform destroyfor the specific module, orterraform applywith previous state - Add-on phases:
helm uninstallfor the specific release - Application phases: ArgoCD sync to previous Git commit, or
argocd app rollback - Database phases: point-in-time recovery from automated Cloud SQL backups
- Go/no-go criteria: defined at each phase boundary to prevent proceeding with a broken foundation. Specific gates:
- After IaC validation: all
tfsec/tflintchecks pass with zero critical findings - After smoke test: all services respond to health checks, end-to-end user flow completes
- After load test: meets defined p95 latency, throughput, and error rate thresholds
- After security assessment: no critical or high vulnerabilities unmitigated
- These gates should be represented as milestones in the Gantt chart with explicit pass/fail criteria documented in the architecture review deliverables
- After IaC validation: all