Infrastructure Archaeology: Diagnosing Multi-Layer CI/CD Failures

Infrastructure Archaeology: Diagnosing Multi-Layer CI/CD Failures

# terraform# gcp# githubactions# docker
Infrastructure Archaeology: Diagnosing Multi-Layer CI/CD Failureslbc

The Pattern Modern cloud infrastructure often evolves through incremental additions. A team starts...

The Pattern

Modern cloud infrastructure often evolves through incremental additions.
A team starts with basic CI/CD, adds Terraform for IaC, integrates
security scanning, sets up monitoring—each piece works in isolation,
but the system as a whole becomes fragile.

Here's a failure pattern I've observed across multiple production
GCP environments: what appears to be "a few broken configs" is actually
a multi-layer architectural problem spanning Docker, Terraform, GitHub
Actions, and cloud-native security tooling.

Let's dissect it.

DISCLAIMER: All code examples, project names, domains, and configurations in this article are sanitized examples for educational purposes. No real client data or proprietary information is exposed. This analysis is based on publicly available documentation and common infrastructure patterns.


The Symptom List

In this pattern, teams typically surface a cluster of related failures:

Build & Container Issues:

  1. Docker multi-stage build misconfigurations — CI/CD pipelines reference non-existent stage names in Dockerfiles
  2. Duplicate or conflicting CMD instructions — containers exhibit unpredictable startup behavior
  3. Image scanning pipeline breaks — security tools block pushes but jobs still succeed

Infrastructure-as-Code Failures:

  1. Terraform module reference errors — output files reference modules that don't exist in the configuration
  2. Variable interface mismatches — calling code passes variables that modules don't accept
  3. Wrong execution context — CI runs IaC commands in incorrect directories
  4. Provider version drift — different environments use incompatible provider versions

CI/CD Architecture Gaps:

  1. Missing deployment automation — builds succeed but nothing triggers actual deployments
  2. No quality gates — tests and builds run in parallel; failures don't block progression
  3. Hardcoded deployment paths — only specific branches trigger deploys; others require manual intervention
  4. Configuration drift — production URLs and domains missing from automation config

Security Tooling Integration Conflicts:

  1. Overlapping vulnerability detection — Trivy, GCP Container Analysis, and Security Command Center all scan the same images
  2. Runtime security false positives — Falco rules trigger on legitimate Cloud Run startup syscalls
  3. Fragmented security reporting — findings appear in multiple systems with no single source of truth
  4. Policy enforcement gaps — security scans run but don't actually block deployments

Tech stack representative of this pattern: GitHub Actions, GCP Cloud Run, Artifact Registry, Terraform, Firebase Hosting, containerized microservices with pnpm/npm monorepo structure.

Seems like a lot of small fixes, right? The reality is more complex.


What I Actually Found: The 3-Layer Problem

These aren't isolated bugs. They're symptoms of failures at three distinct levels.

Layer 1: The Obvious (Syntax & Configuration Errors)

These are the errors you see immediately when you run the tools:

Docker Target Mismatch:

# Dockerfile declares:
FROM node:20-alpine AS runner

# GitHub Action requests:
with:
  target: app # ❌ Stage "app" doesn't exist
Enter fullscreen mode Exit fullscreen mode

Terraform Module Reference:

# outputs.tf tries to reference:
output "api_url" {
  value = module.cloud_run_api.service_url # ❌ Module doesn't exist
}

# main.tf actually has:
module "api_service" { # Different name!
  source = "../../modules/cloud-run"
}
Enter fullscreen mode Exit fullscreen mode

Variable Name Mismatch:

# envs/prod/main.tf sends:
module "api" {
  service_name = "api-prod" # ❌ Module doesn't accept this
}

# modules/cloud-run/variables.tf expects:
variable "name" { # Different variable!
  type = string
}
Enter fullscreen mode Exit fullscreen mode

These are language and consistency errors. Terraform requires that any resource or module referenced in output files be explicitly declared in the active configuration. When you refactor and change module names in main.tf but forget to update outputs.tf, you get this.

The fix? Run terraform validate — it catches these immediately without even connecting to the cloud.

Layer 2: Platform Changes (Hidden Causes)

This is where it gets interesting. Some failures aren't in the code — they're in how GCP's platform has evolved.

GCP Service Account Permission Changes:

GCP recently changed how Cloud Build uses service accounts. What used to work automatically now fails because the build service account no longer has default permissions to write logs or read from Artifact Registry.

The missing piece: iam.serviceaccounts.actAs permission, required for one identity to assume the role of a runtime service account.

Organization Policy Restrictions:

That "Firebase region conflict" isn't a typo in your Terraform. It's a collision with constraints/gcp.resourceLocations — an organization policy that blocks deployments to certain regions, even if your Terraform syntax is perfect.

VPC Service Controls:

If the project sits inside a VPC Service Controls perimeter, Cloud Run deployments can fail silently with confusing 403/404 errors. The perimeter blocks communication between Google services — like the Cloud Run agent trying to read images from Artifact Registry.

Security Tooling Conflicts:

When security tools are added incrementally — each solving a specific
problem in isolation — they create overlapping responsibilities and
contradictory enforcement policies.

A typical pattern:

  • Trivy is added to CI to scan container images before push
  • Falco is added to monitor runtime behavior in Cloud Run
  • GCP Container Analysis API scans images automatically on push to Artifact Registry
  • Security Command Center aggregates findings across the project

Each tool works. The integration doesn't.

The failure cascade:

  1. Trivy finds a CVE and is configured to block the push
  2. The GitHub Action reports success anyway (exit code not wired correctly)
  3. Image gets pushed to Artifact Registry
  4. Container Analysis API scans the same image 10 minutes later
  5. Falco triggers alerts on normal Cloud Run startup syscalls (false positive)
  6. Security Command Center reports the same CVE 3 hours later
  7. Three different alerting systems fire
  8. No one knows which finding to trust or act on first

Root cause: No centralized security policy. Each tool was added
without defining ownership, enforcement boundaries, or a single
source of truth for findings.

The hidden cost: Security tools that don't actually gate deployments give a false sense of protection. The pipeline feels secure. It isn't.

GCP Resource Name Limits:

GCP has a 63-character limit for resource names. If your Terraform generates names that exceed this (long prefixes like baseInstanceName), the system truncates them, causing duplicate name conflicts and deployment failures.

These aren't bugs in your code. They're platform governance and technical constraints that interact badly with naive configurations.

Layer 3: Architectural Debt (The Root Problem)

The deepest layer isn't about syntax or permissions — it's about missing architecture.

No CI/CD Gates:

The build and CI workflows are decoupled. Tests can fail, but images still get built and pushed. There's no needs: dependency chain enforcing that tests pass before builds run.

# What's happening:
jobs:
  test:
    runs-on: ubuntu-latest
  build:
    runs-on: ubuntu-latest # ❌ Runs in parallel, doesn't wait for tests
Enter fullscreen mode Exit fullscreen mode

Wrong Directory Context:

GitHub Actions runs terraform plan in the repository root instead of envs/staging/. Terraform is directory-dependent — without the right context, it validates an empty or incomplete configuration.

Hardcoded Feature Branch:

Only one deployment path works: a specific feature branch → staging. There's no development → staging automation, no main → production workflow. Everything else is manual.

Missing Environment Variables:

Production URLs and domains aren't defined anywhere in the automation. Cloud Run services deploy without knowing their actual domain mappings, leaving SSL certificates stuck in provisioning or external access failing with 404/502.

This is lifecycle orchestration failure. Someone built pieces that "worked" in isolation but never architected how they fit together.


Why Fixing Order Matters

You can't just "fix what's broken." Here's why sequence matters:

Fix production Terraform first → Staging still broken, can't test changes

Wire up CI gates first → Builds still fail, nothing to gate

Add domain configs first → Deployments fail before they even reach the domain mapping step

Fix build errors → then CI validation → then deployment automation → then configuration gaps

Think of it like renovating a house: you can't install the roof if the foundation is cracked. You can't paint the walls if the plumbing leaks.

The Remediation Strategy:

Day 1-2: Fix blocking issues (foundation)

Day 3-4: Wire up automation (plumbing)

Day 5: Clean up medium issues (finishing touches)

This bottom-up approach ensures each layer is stable before building on top of it.


How to Actually Fix This

Issue #1: Docker Target Mismatch

Quick diagnosis:

grep "AS " apps/api/Dockerfile # See what stage names actually exist
grep "target:" .github/workflows/*.yml # See what CI requests
Enter fullscreen mode Exit fullscreen mode

The fix:

# Option A: Fix the composite action (recommended)
# .github/actions/build-push/action.yml
- name: Build and push
  uses: docker/build-push-action@v5
  with:
    target: runner # ✅ Match Dockerfile stage name
Enter fullscreen mode Exit fullscreen mode
# Option B: Fix the Dockerfile
FROM node:20-alpine AS app # ✅ Match action target
Enter fullscreen mode Exit fullscreen mode

Why it works: Docker multi-stage builds use FROM ... AS <name> to label stages. The --target flag tells Docker which stage to stop at. Mismatched names = build failure.


Issue #2: Staging Terraform Undefined Module

Quick diagnosis:

cd envs/staging
grep -n "module\." outputs.tf # Find all module references
grep -n 'module "' main.tf # Find all module declarations
# Names must match exactly
Enter fullscreen mode Exit fullscreen mode

The fix:

# outputs.tf (BEFORE)
output "api_url" {
  value = module.cloud_run_api.service_url # ❌
}

# outputs.tf (AFTER)
output "api_url" {
  value = module.api_service.service_url # ✅ Match actual module name
}
Enter fullscreen mode Exit fullscreen mode

Validation:

terraform init
terraform validate # Must pass
terraform plan # Should show changes, not errors
Enter fullscreen mode Exit fullscreen mode

Why it works: Terraform's output system requires module references to exist in the configuration. This is caught during the validation phase, which checks internal consistency without cloud access.


Issue #3: Production Variable Mismatch

Quick diagnosis:

# Check what the module expects
cat modules/cloud-run/variables.tf

# Check what production sends
grep -A 10 'module "api"' envs/prod/main.tf
Enter fullscreen mode Exit fullscreen mode

The fix:

# envs/prod/main.tf (BEFORE)
module "api" {
  source = "../../modules/cloud-run"
  service_name = "api-prod" # ❌ Module doesn't have this variable
  container_port = 8080 # ❌
}

# envs/prod/main.tf (AFTER)
module "api" {
  source = "../../modules/cloud-run"
  name = "api-prod" # ✅ Match module's variable.tf
  port = 8080 # ✅
}
Enter fullscreen mode Exit fullscreen mode

Why it works: Terraform modules define a contract through variables.tf. The calling code must provide values that match these declared variables. Interface mismatches halt plan generation.


Issue #4: Wrong Directory in CI

Quick diagnosis:

# Check if workflow sets working directory
grep -A 5 "defaults:" .github/workflows/terraform-ci.yml
Enter fullscreen mode Exit fullscreen mode

The fix:

# .github/workflows/terraform-ci.yml (BEFORE)
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - run: terraform init # ❌ Runs in repo root

# .github/workflows/terraform-ci.yml (AFTER)
jobs:
  validate:
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: ./envs/staging # ✅ Set context
    steps:
      - run: terraform init # Now runs in correct directory
Enter fullscreen mode Exit fullscreen mode

Why it works: Terraform is context-dependent. Without explicit directory specification, commands run in $GITHUB_WORKSPACE (repo root), where no .tf files exist for the specific environment.


Issue #5-6: Missing Deployment Automation

Create: .github/workflows/deploy-staging.yml

name: Deploy to Staging

on:
  push:
    branches:
      - development
    paths:
      - 'apps/**'
      - 'packages/**'

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Setup Node
        uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'pnpm'
      - run: pnpm install
      - run: pnpm test
      - run: pnpm lint

  build:
    needs: test # ✅ Only runs if tests pass
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Auth to GCP
        uses: google-github-actions/auth@v2
        with:
          workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
          service_account: ${{ secrets.GCP_SA_EMAIL }}
      - name: Build API
        uses: ./.github/actions/build-push
        with:
          dockerfile: apps/api/Dockerfile
          image: us-central1-docker.pkg.dev/${{ secrets.GCP_PROJECT }}/images/api
          tag: staging-${{ github.sha }}
          build-target: runner # ✅ Fix for issue #1

  deploy:
    needs: build # ✅ Only runs if build succeeds
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: ./envs/staging
    steps:
      - uses: actions/checkout@v4
      - name: Auth to GCP
        uses: google-github-actions/auth@v2
        with:
          workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
          service_account: ${{ secrets.GCP_SA_EMAIL }}
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
      - run: terraform init
      - run: terraform plan -var="image_tag=staging-${{ github.sha }}" -out=tfplan
      - run: terraform apply -auto-approve tfplan
Enter fullscreen mode Exit fullscreen mode

Why it works: The needs: keyword creates job dependencies. GitHub Actions won't run build until test succeeds, won't run deploy until build succeeds. This is the "gating" that was missing.


Issue #7: CI Doesn't Gate Deployments

Already solved in Issue #5-6. The key is the needs: chain:

test → build → deploy
Enter fullscreen mode Exit fullscreen mode

Each step must complete successfully before the next begins.


Issue #8: URL Configuration Gaps

Create centralized config:

# envs/staging/terraform.tfvars
project_id = "myproject-staging"
region = "us-central1"

domains = {
  api = "api-staging.myapp.com"
  web = "staging.myapp.com"
}
Enter fullscreen mode Exit fullscreen mode

Use in module:

# modules/cloud-run/main.tf
resource "google_cloud_run_service" "service" {
  name = var.name
  location = var.region

  template {
    spec {
      containers {
        image = var.image

        env {
          name = "API_URL"
          value = "https://${var.api_domain}"
        }
        env {
          name = "WEB_URL"
          value = "https://${var.web_domain}"
        }
      }
    }
  }
}

resource "google_cloud_run_domain_mapping" "domain" {
  location = var.region
  name = var.custom_domain

  spec {
    route_name = google_cloud_run_service.service.name
  }
}
Enter fullscreen mode Exit fullscreen mode

Update GitHub Secrets:

gh secret set STAGING_API_URL --body "https://api-staging.myapp.com"
gh secret set STAGING_WEB_URL --body "https://staging.myapp.com"
Enter fullscreen mode Exit fullscreen mode

Why it works: Cloud Run requires domain validation and DNS configuration. Without these URLs in Terraform, the platform can't set up SSL certificates or route external traffic correctly.

Issues #12-15: Security Tooling Integration Conflicts

Quick diagnosis:

# Check if Trivy actually fails the job on findings
grep -A 10 "trivy" .github/workflows/*.yml
# Look for: exit-code: '1' and severity threshold

# Check for duplicate scanning
grep -r "scan\|trivy\|falco\|vulnerability" .github/workflows/*.yml

# Check Falco rules for Cloud Run compatibility
cat falco-rules/custom-rules.yaml | grep -i "container\|syscall"

# Check if Container Analysis is enabled
gcloud services list --enabled | grep containeranalysis
Enter fullscreen mode Exit fullscreen mode

The fix — Option A: GCP Native (simpler):

Consolidate on GCP's built-in security tooling and remove
redundant third-party tools:

# .github/workflows/deploy-staging.yml
jobs:
  security-scan:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: Scan image with Trivy
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ${{ env.IMAGE_TAG }}
          format: 'sarif'
          exit-code: '1'        # ✅ Actually fails the job
          severity: 'CRITICAL,HIGH'
          output: 'trivy-results.sarif'

      - name: Upload results to Security Command Center
        uses: github/codeql-action/upload-sarif@v2
        with:
          sarif_file: 'trivy-results.sarif'

  deploy:
    needs: security-scan  # ✅ Deploy only if scan passes
    runs-on: ubuntu-latest
    steps: [...]
Enter fullscreen mode Exit fullscreen mode
# modules/cloud-run/main.tf
# Use GCP Binary Authorization instead of Falco for deploy-time enforcement
resource "google_binary_authorization_policy" "policy" {
  project = var.project_id

  default_admission_rule {
    evaluation_mode  = "REQUIRE_ATTESTATION"
    enforcement_mode = "ENFORCED_BLOCK_AND_AUDIT_LOG"

    require_attestations_by = [
      google_binary_authorization_attestor.trivy_passed.name
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

The fix — Option B: Trivy + Falco (more control):

Keep both tools but define clear ownership boundaries:

# Trivy owns: pre-deploy image scanning (CI gate)
# Falco owns: runtime anomaly detection (post-deploy monitoring)
# Security Command Center owns: compliance reporting (audit trail)
# Container Analysis: disabled (redundant with Trivy)

# .github/workflows/deploy-staging.yml
jobs:
  scan:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: Trivy scan
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ${{ env.IMAGE_TAG }}
          exit-code: '1'          # ✅ Hard gate
          severity: 'CRITICAL'
          ignore-unfixed: true    # Reduce noise

  deploy:
    needs: scan  # ✅ Trivy must pass
    runs-on: ubuntu-latest
    steps: [...]
Enter fullscreen mode Exit fullscreen mode
# falco-rules/cloud-run-rules.yaml
# Tune Falco to ignore Cloud Run startup behavior
- rule: Unexpected syscall in container
  desc: Detect anomalous syscalls at runtime
  condition: >
    spawned_process and container
    and not proc.name in (cloud_run_allowed_processes)
    and not container.image.repository contains "gcr.io/cloudrun"
  output: "Unexpected process %proc.name in %container.name"
  priority: WARNING

- macro: cloud_run_allowed_processes
  condition: >
    proc.name in (node, python, java, nginx, sh, bash)
    and not proc.cmdline contains "curl metadata"  # Block SSRF attempts
Enter fullscreen mode Exit fullscreen mode

Fix for Security Command Center duplicate findings:

# Disable Container Analysis if using Trivy (avoid duplicates)
gcloud services disable containeranalysis.googleapis.com

# OR: Configure SCC to deduplicate findings
gcloud scc settings update \
  --organization=YOUR_ORG_ID \
  --enable-asset-discovery
Enter fullscreen mode Exit fullscreen mode

Why it works: Each security tool has a defined role with clear
enforcement boundaries. Trivy gates at build time. Falco monitors
at runtime. Security Command Center handles compliance reporting.
No overlaps, no gaps, no false sense of security.

The architectural principle:
Security tools should be additive in coverage, not redundant in scope.


Common Gotchas During Remediation

🚩 "I fixed the Dockerfile but CI still fails"

→ Check if the composite action caches the old target name. Clear workflow cache or update the action's default input.

🚩 "Terraform validate passes but plan fails"

→ You're probably in the wrong directory. Check pwd in your CI logs and verify working-directory is set.

🚩 "Images build but Cloud Run deployment fails"

→ Service account permissions (Layer 2). Run:

gcloud projects get-iam-policy YOUR_PROJECT \
  --flatten="bindings[].members" \
  --filter="bindings.members:serviceAccount:*@cloudbuild.gserviceaccount.com"
Enter fullscreen mode Exit fullscreen mode

🚩 "Firebase deployment fails with region conflict"

→ Check org policy:

gcloud resource-manager org-policies describe \
  constraints/gcp.resourceLocations \
  --project=YOUR_PROJECT
Enter fullscreen mode Exit fullscreen mode

🚩 "Variables are undefined in running container"

→ Don't put them in the Dockerfile. Inject via Terraform's env blocks in the Cloud Run service definition.

🚩 "Trivy scan passes but vulnerable images still get deployed"
→ Check exit-code configuration. Trivy reports findings by default
but doesn't fail the job unless exit-code: '1' is explicitly set
with a severity threshold.

🚩 "Falco generates hundreds of alerts on Cloud Run startup"
→ Cloud Run has a specific startup sequence that triggers generic
Falco rules. Add Cloud Run-specific macros to your custom rules
to filter legitimate startup behavior.

🚩 "Security Command Center shows the same CVE from 3 different sources"
→ You have overlapping scanners. Decide on a single source of truth
(Trivy OR Container Analysis, not both) and disable the redundant one.

🚩 "Binary Authorization blocks deployment after security scan passes"
→ The attestor isn't linked to your Trivy results. The attestation
step must explicitly create a Binary Authorization attestation after
a successful scan.


What This Analysis Doesn't Cover

If this was real infrastructure, you would need to check the next points:

  • Terraform state drift (manual changes in GCP)
  • Networking/DNS configuration details
  • Secret management implementation
  • The full history of how the system reached this state

But: For declared issues, these are all the documented root causes according to official Terraform, Docker, GitHub Actions, and GCP documentation.

Think of this as: symptoms → probable diagnosis. The real fix needs hands on the actual system.


Visual: The 3-Layer Problem

Fix bottom-up, not top-down.


Conclusion

Infrastructure failures rarely have a single cause. What looks like "broken Terraform" is usually a combination of:

  • Configuration errors (Layer 1)
  • Platform evolution you didn't track (Layer 2)
  • Missing architectural decisions (Layer 3)

The fix isn't just correcting syntax — it's understanding how these layers interact and building a system that's resilient to change.

Key takeaways:

  1. Diagnose in layers. Don't stop at the obvious errors.
  2. Fix in order. Foundation before plumbing before paint.
  3. Build in gates. Make it impossible for broken code to reach production.
  4. Document decisions. Future you (or the next developer) needs context.
  5. Scope honestly. Complex infrastructure work takes time. Price accordingly.

The goal isn't just to fix what's broken today — it's to build a system that won't break the same way tomorrow.