
lbcThe Pattern Modern cloud infrastructure often evolves through incremental additions. A team starts...
The Pattern
Modern cloud infrastructure often evolves through incremental additions.
A team starts with basic CI/CD, adds Terraform for IaC, integrates
security scanning, sets up monitoring—each piece works in isolation,
but the system as a whole becomes fragile.
Here's a failure pattern I've observed across multiple production
GCP environments: what appears to be "a few broken configs" is actually
a multi-layer architectural problem spanning Docker, Terraform, GitHub
Actions, and cloud-native security tooling.
Let's dissect it.
DISCLAIMER: All code examples, project names, domains, and configurations in this article are sanitized examples for educational purposes. No real client data or proprietary information is exposed. This analysis is based on publicly available documentation and common infrastructure patterns.
The Symptom List
In this pattern, teams typically surface a cluster of related failures:
Build & Container Issues:
Infrastructure-as-Code Failures:
CI/CD Architecture Gaps:
Security Tooling Integration Conflicts:
Tech stack representative of this pattern: GitHub Actions, GCP Cloud Run, Artifact Registry, Terraform, Firebase Hosting, containerized microservices with pnpm/npm monorepo structure.
Seems like a lot of small fixes, right? The reality is more complex.
What I Actually Found: The 3-Layer Problem
These aren't isolated bugs. They're symptoms of failures at three distinct levels.
Layer 1: The Obvious (Syntax & Configuration Errors)
These are the errors you see immediately when you run the tools:
Docker Target Mismatch:
# Dockerfile declares:
FROM node:20-alpine AS runner
# GitHub Action requests:
with:
target: app # ❌ Stage "app" doesn't exist
Terraform Module Reference:
# outputs.tf tries to reference:
output "api_url" {
value = module.cloud_run_api.service_url # ❌ Module doesn't exist
}
# main.tf actually has:
module "api_service" { # Different name!
source = "../../modules/cloud-run"
}
Variable Name Mismatch:
# envs/prod/main.tf sends:
module "api" {
service_name = "api-prod" # ❌ Module doesn't accept this
}
# modules/cloud-run/variables.tf expects:
variable "name" { # Different variable!
type = string
}
These are language and consistency errors. Terraform requires that any resource or module referenced in output files be explicitly declared in the active configuration. When you refactor and change module names in main.tf but forget to update outputs.tf, you get this.
The fix? Run terraform validate — it catches these immediately without even connecting to the cloud.
Layer 2: Platform Changes (Hidden Causes)
This is where it gets interesting. Some failures aren't in the code — they're in how GCP's platform has evolved.
GCP Service Account Permission Changes:
GCP recently changed how Cloud Build uses service accounts. What used to work automatically now fails because the build service account no longer has default permissions to write logs or read from Artifact Registry.
The missing piece: iam.serviceaccounts.actAs permission, required for one identity to assume the role of a runtime service account.
Organization Policy Restrictions:
That "Firebase region conflict" isn't a typo in your Terraform. It's a collision with constraints/gcp.resourceLocations — an organization policy that blocks deployments to certain regions, even if your Terraform syntax is perfect.
VPC Service Controls:
If the project sits inside a VPC Service Controls perimeter, Cloud Run deployments can fail silently with confusing 403/404 errors. The perimeter blocks communication between Google services — like the Cloud Run agent trying to read images from Artifact Registry.
Security Tooling Conflicts:
When security tools are added incrementally — each solving a specific
problem in isolation — they create overlapping responsibilities and
contradictory enforcement policies.
A typical pattern:
Each tool works. The integration doesn't.
The failure cascade:
Root cause: No centralized security policy. Each tool was added
without defining ownership, enforcement boundaries, or a single
source of truth for findings.
The hidden cost: Security tools that don't actually gate deployments give a false sense of protection. The pipeline feels secure. It isn't.
GCP Resource Name Limits:
GCP has a 63-character limit for resource names. If your Terraform generates names that exceed this (long prefixes like baseInstanceName), the system truncates them, causing duplicate name conflicts and deployment failures.
These aren't bugs in your code. They're platform governance and technical constraints that interact badly with naive configurations.
Layer 3: Architectural Debt (The Root Problem)
The deepest layer isn't about syntax or permissions — it's about missing architecture.
No CI/CD Gates:
The build and CI workflows are decoupled. Tests can fail, but images still get built and pushed. There's no needs: dependency chain enforcing that tests pass before builds run.
# What's happening:
jobs:
test:
runs-on: ubuntu-latest
build:
runs-on: ubuntu-latest # ❌ Runs in parallel, doesn't wait for tests
Wrong Directory Context:
GitHub Actions runs terraform plan in the repository root instead of envs/staging/. Terraform is directory-dependent — without the right context, it validates an empty or incomplete configuration.
Hardcoded Feature Branch:
Only one deployment path works: a specific feature branch → staging. There's no development → staging automation, no main → production workflow. Everything else is manual.
Missing Environment Variables:
Production URLs and domains aren't defined anywhere in the automation. Cloud Run services deploy without knowing their actual domain mappings, leaving SSL certificates stuck in provisioning or external access failing with 404/502.
This is lifecycle orchestration failure. Someone built pieces that "worked" in isolation but never architected how they fit together.
Why Fixing Order Matters
You can't just "fix what's broken." Here's why sequence matters:
❌ Fix production Terraform first → Staging still broken, can't test changes
❌ Wire up CI gates first → Builds still fail, nothing to gate
❌ Add domain configs first → Deployments fail before they even reach the domain mapping step
✅ Fix build errors → then CI validation → then deployment automation → then configuration gaps
Think of it like renovating a house: you can't install the roof if the foundation is cracked. You can't paint the walls if the plumbing leaks.
The Remediation Strategy:
Day 1-2: Fix blocking issues (foundation)
Day 3-4: Wire up automation (plumbing)
Day 5: Clean up medium issues (finishing touches)
This bottom-up approach ensures each layer is stable before building on top of it.
How to Actually Fix This
Issue #1: Docker Target Mismatch
Quick diagnosis:
grep "AS " apps/api/Dockerfile # See what stage names actually exist
grep "target:" .github/workflows/*.yml # See what CI requests
The fix:
# Option A: Fix the composite action (recommended)
# .github/actions/build-push/action.yml
- name: Build and push
uses: docker/build-push-action@v5
with:
target: runner # ✅ Match Dockerfile stage name
# Option B: Fix the Dockerfile
FROM node:20-alpine AS app # ✅ Match action target
Why it works: Docker multi-stage builds use FROM ... AS <name> to label stages. The --target flag tells Docker which stage to stop at. Mismatched names = build failure.
Issue #2: Staging Terraform Undefined Module
Quick diagnosis:
cd envs/staging
grep -n "module\." outputs.tf # Find all module references
grep -n 'module "' main.tf # Find all module declarations
# Names must match exactly
The fix:
# outputs.tf (BEFORE)
output "api_url" {
value = module.cloud_run_api.service_url # ❌
}
# outputs.tf (AFTER)
output "api_url" {
value = module.api_service.service_url # ✅ Match actual module name
}
Validation:
terraform init
terraform validate # Must pass
terraform plan # Should show changes, not errors
Why it works: Terraform's output system requires module references to exist in the configuration. This is caught during the validation phase, which checks internal consistency without cloud access.
Issue #3: Production Variable Mismatch
Quick diagnosis:
# Check what the module expects
cat modules/cloud-run/variables.tf
# Check what production sends
grep -A 10 'module "api"' envs/prod/main.tf
The fix:
# envs/prod/main.tf (BEFORE)
module "api" {
source = "../../modules/cloud-run"
service_name = "api-prod" # ❌ Module doesn't have this variable
container_port = 8080 # ❌
}
# envs/prod/main.tf (AFTER)
module "api" {
source = "../../modules/cloud-run"
name = "api-prod" # ✅ Match module's variable.tf
port = 8080 # ✅
}
Why it works: Terraform modules define a contract through variables.tf. The calling code must provide values that match these declared variables. Interface mismatches halt plan generation.
Issue #4: Wrong Directory in CI
Quick diagnosis:
# Check if workflow sets working directory
grep -A 5 "defaults:" .github/workflows/terraform-ci.yml
The fix:
# .github/workflows/terraform-ci.yml (BEFORE)
jobs:
validate:
runs-on: ubuntu-latest
steps:
- run: terraform init # ❌ Runs in repo root
# .github/workflows/terraform-ci.yml (AFTER)
jobs:
validate:
runs-on: ubuntu-latest
defaults:
run:
working-directory: ./envs/staging # ✅ Set context
steps:
- run: terraform init # Now runs in correct directory
Why it works: Terraform is context-dependent. Without explicit directory specification, commands run in $GITHUB_WORKSPACE (repo root), where no .tf files exist for the specific environment.
Issue #5-6: Missing Deployment Automation
Create: .github/workflows/deploy-staging.yml
name: Deploy to Staging
on:
push:
branches:
- development
paths:
- 'apps/**'
- 'packages/**'
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node
uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'pnpm'
- run: pnpm install
- run: pnpm test
- run: pnpm lint
build:
needs: test # ✅ Only runs if tests pass
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Auth to GCP
uses: google-github-actions/auth@v2
with:
workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
service_account: ${{ secrets.GCP_SA_EMAIL }}
- name: Build API
uses: ./.github/actions/build-push
with:
dockerfile: apps/api/Dockerfile
image: us-central1-docker.pkg.dev/${{ secrets.GCP_PROJECT }}/images/api
tag: staging-${{ github.sha }}
build-target: runner # ✅ Fix for issue #1
deploy:
needs: build # ✅ Only runs if build succeeds
runs-on: ubuntu-latest
defaults:
run:
working-directory: ./envs/staging
steps:
- uses: actions/checkout@v4
- name: Auth to GCP
uses: google-github-actions/auth@v2
with:
workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
service_account: ${{ secrets.GCP_SA_EMAIL }}
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
- run: terraform init
- run: terraform plan -var="image_tag=staging-${{ github.sha }}" -out=tfplan
- run: terraform apply -auto-approve tfplan
Why it works: The needs: keyword creates job dependencies. GitHub Actions won't run build until test succeeds, won't run deploy until build succeeds. This is the "gating" that was missing.
Issue #7: CI Doesn't Gate Deployments
Already solved in Issue #5-6. The key is the needs: chain:
test → build → deploy
Each step must complete successfully before the next begins.
Issue #8: URL Configuration Gaps
Create centralized config:
# envs/staging/terraform.tfvars
project_id = "myproject-staging"
region = "us-central1"
domains = {
api = "api-staging.myapp.com"
web = "staging.myapp.com"
}
Use in module:
# modules/cloud-run/main.tf
resource "google_cloud_run_service" "service" {
name = var.name
location = var.region
template {
spec {
containers {
image = var.image
env {
name = "API_URL"
value = "https://${var.api_domain}"
}
env {
name = "WEB_URL"
value = "https://${var.web_domain}"
}
}
}
}
}
resource "google_cloud_run_domain_mapping" "domain" {
location = var.region
name = var.custom_domain
spec {
route_name = google_cloud_run_service.service.name
}
}
Update GitHub Secrets:
gh secret set STAGING_API_URL --body "https://api-staging.myapp.com"
gh secret set STAGING_WEB_URL --body "https://staging.myapp.com"
Why it works: Cloud Run requires domain validation and DNS configuration. Without these URLs in Terraform, the platform can't set up SSL certificates or route external traffic correctly.
Issues #12-15: Security Tooling Integration Conflicts
Quick diagnosis:
# Check if Trivy actually fails the job on findings
grep -A 10 "trivy" .github/workflows/*.yml
# Look for: exit-code: '1' and severity threshold
# Check for duplicate scanning
grep -r "scan\|trivy\|falco\|vulnerability" .github/workflows/*.yml
# Check Falco rules for Cloud Run compatibility
cat falco-rules/custom-rules.yaml | grep -i "container\|syscall"
# Check if Container Analysis is enabled
gcloud services list --enabled | grep containeranalysis
The fix — Option A: GCP Native (simpler):
Consolidate on GCP's built-in security tooling and remove
redundant third-party tools:
# .github/workflows/deploy-staging.yml
jobs:
security-scan:
needs: build
runs-on: ubuntu-latest
steps:
- name: Scan image with Trivy
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.IMAGE_TAG }}
format: 'sarif'
exit-code: '1' # ✅ Actually fails the job
severity: 'CRITICAL,HIGH'
output: 'trivy-results.sarif'
- name: Upload results to Security Command Center
uses: github/codeql-action/upload-sarif@v2
with:
sarif_file: 'trivy-results.sarif'
deploy:
needs: security-scan # ✅ Deploy only if scan passes
runs-on: ubuntu-latest
steps: [...]
# modules/cloud-run/main.tf
# Use GCP Binary Authorization instead of Falco for deploy-time enforcement
resource "google_binary_authorization_policy" "policy" {
project = var.project_id
default_admission_rule {
evaluation_mode = "REQUIRE_ATTESTATION"
enforcement_mode = "ENFORCED_BLOCK_AND_AUDIT_LOG"
require_attestations_by = [
google_binary_authorization_attestor.trivy_passed.name
]
}
}
The fix — Option B: Trivy + Falco (more control):
Keep both tools but define clear ownership boundaries:
# Trivy owns: pre-deploy image scanning (CI gate)
# Falco owns: runtime anomaly detection (post-deploy monitoring)
# Security Command Center owns: compliance reporting (audit trail)
# Container Analysis: disabled (redundant with Trivy)
# .github/workflows/deploy-staging.yml
jobs:
scan:
needs: build
runs-on: ubuntu-latest
steps:
- name: Trivy scan
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.IMAGE_TAG }}
exit-code: '1' # ✅ Hard gate
severity: 'CRITICAL'
ignore-unfixed: true # Reduce noise
deploy:
needs: scan # ✅ Trivy must pass
runs-on: ubuntu-latest
steps: [...]
# falco-rules/cloud-run-rules.yaml
# Tune Falco to ignore Cloud Run startup behavior
- rule: Unexpected syscall in container
desc: Detect anomalous syscalls at runtime
condition: >
spawned_process and container
and not proc.name in (cloud_run_allowed_processes)
and not container.image.repository contains "gcr.io/cloudrun"
output: "Unexpected process %proc.name in %container.name"
priority: WARNING
- macro: cloud_run_allowed_processes
condition: >
proc.name in (node, python, java, nginx, sh, bash)
and not proc.cmdline contains "curl metadata" # Block SSRF attempts
Fix for Security Command Center duplicate findings:
# Disable Container Analysis if using Trivy (avoid duplicates)
gcloud services disable containeranalysis.googleapis.com
# OR: Configure SCC to deduplicate findings
gcloud scc settings update \
--organization=YOUR_ORG_ID \
--enable-asset-discovery
Why it works: Each security tool has a defined role with clear
enforcement boundaries. Trivy gates at build time. Falco monitors
at runtime. Security Command Center handles compliance reporting.
No overlaps, no gaps, no false sense of security.
The architectural principle:
Security tools should be additive in coverage, not redundant in scope.
Common Gotchas During Remediation
🚩 "I fixed the Dockerfile but CI still fails"
→ Check if the composite action caches the old target name. Clear workflow cache or update the action's default input.
🚩 "Terraform validate passes but plan fails"
→ You're probably in the wrong directory. Check pwd in your CI logs and verify working-directory is set.
🚩 "Images build but Cloud Run deployment fails"
→ Service account permissions (Layer 2). Run:
gcloud projects get-iam-policy YOUR_PROJECT \
--flatten="bindings[].members" \
--filter="bindings.members:serviceAccount:*@cloudbuild.gserviceaccount.com"
🚩 "Firebase deployment fails with region conflict"
→ Check org policy:
gcloud resource-manager org-policies describe \
constraints/gcp.resourceLocations \
--project=YOUR_PROJECT
🚩 "Variables are undefined in running container"
→ Don't put them in the Dockerfile. Inject via Terraform's env blocks in the Cloud Run service definition.
🚩 "Trivy scan passes but vulnerable images still get deployed"
→ Check exit-code configuration. Trivy reports findings by default
but doesn't fail the job unless exit-code: '1' is explicitly set
with a severity threshold.
🚩 "Falco generates hundreds of alerts on Cloud Run startup"
→ Cloud Run has a specific startup sequence that triggers generic
Falco rules. Add Cloud Run-specific macros to your custom rules
to filter legitimate startup behavior.
🚩 "Security Command Center shows the same CVE from 3 different sources"
→ You have overlapping scanners. Decide on a single source of truth
(Trivy OR Container Analysis, not both) and disable the redundant one.
🚩 "Binary Authorization blocks deployment after security scan passes"
→ The attestor isn't linked to your Trivy results. The attestation
step must explicitly create a Binary Authorization attestation after
a successful scan.
What This Analysis Doesn't Cover
If this was real infrastructure, you would need to check the next points:
But: For declared issues, these are all the documented root causes according to official Terraform, Docker, GitHub Actions, and GCP documentation.
Think of this as: symptoms → probable diagnosis. The real fix needs hands on the actual system.
Visual: The 3-Layer Problem
Fix bottom-up, not top-down.
Conclusion
Infrastructure failures rarely have a single cause. What looks like "broken Terraform" is usually a combination of:
The fix isn't just correcting syntax — it's understanding how these layers interact and building a system that's resilient to change.
Key takeaways:
The goal isn't just to fix what's broken today — it's to build a system that won't break the same way tomorrow.