DevOps, Cloud & Deployment Fundamentals

The Deployment Gap

Writing code is not shipping software. Between “it works on my laptop” and “it runs reliably in production for 1,000 users” lies a gap that kills many engineering software projects. DevOps practices and cloud infrastructure are how that gap is bridged.

Containers at Scale: From Docker to Orchestration

You’ve run Docker containers. In production, you have dozens or hundreds of containers to manage: start them, stop them, restart crashed ones, load-balance traffic between them, and scale them up when demand increases.

Kubernetes (K8s) is the industry standard for container orchestration:

# Kubernetes Deployment: run 3 instances of the simulation API
apiVersion: apps/v1
kind: Deployment
metadata:
  name: simulation-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: simulation-api
  template:
    metadata:
      labels:
        app: simulation-api
    spec:
      containers:
      - name: api
        image: myrepo/simulation-api:v1.2.3
        ports:
        - containerPort: 8080
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: url

Kubernetes handles: automatic restart of failed containers, rolling deployment (update without downtime), scaling (increase replicas based on CPU/memory), service discovery, and load balancing.

Infrastructure as Code (IaC)

Infrastructure as Code means defining your cloud infrastructure in code files (version-controlled, reviewable, repeatable) rather than clicking through cloud consoles.

Tools: Terraform (provider-agnostic), AWS CloudFormation (AWS-native), Pulumi (real programming languages).

# Terraform: provision a simulation compute cluster on AWS
resource "aws_instance" "sim_worker" {
  count         = 4
  ami           = "ami-0abcdef1234567890"
  instance_type = "c6i.8xlarge"    # 32 vCPUs, 64GB RAM

  tags = {
    Name = "sim-worker-${count.index}"
    Team = "structural-analysis"
  }
}

resource "aws_s3_bucket" "simulation_results" {
  bucket = "my-firm-simulation-results-2026"
}

resource "aws_sqs_queue" "job_queue" {
  name                       = "simulation-jobs"
  message_retention_seconds  = 86400
  visibility_timeout_seconds = 3600
}

Why IaC matters: Your compute cluster, storage, and network are now version-controlled alongside your solver code. Identical environments for dev, test, and production. Disaster recovery is: terraform apply. Auditing is: git log.

CI/CD Pipelines in Full

A complete CI/CD pipeline for a simulation platform:

Developer pushes to feature branch
         │
         ▼
[CI: Lint + Type Check]        (seconds)
         │
         ▼
[CI: Unit Tests]               (1–5 min)
         │
         ▼
[CI: Integration Tests]        (5–30 min)
         │
         ▼
[CI: Build Docker Image]       (2–5 min)
         │
         ▼
[CD: Deploy to Staging]        (2 min)
         │
         ▼
[CD: Smoke Tests on Staging]   (5 min)
         │
         ▼ (manual approval gate)
[CD: Deploy to Production]     (rolling: 5 min, zero downtime)
         │
         ▼
[Monitoring: Check error rates, latency]
         │
         ▼ (automated rollback if metrics degrade)

Deployment Strategies

Big-bang deployment: Take the old version down, put the new version up. Simple. Causes downtime. Only acceptable for maintenance windows.

Rolling deployment: Replace instances one at a time. At any moment, some run the old version, some run the new. No downtime, but old and new must be compatible simultaneously.

Blue/green deployment: Two identical environments (blue = current, green = new). Deploy to green, test it, then switch traffic. Rollback is instant (switch back to blue). Costs double the infrastructure.

Canary deployment: Route a small percentage of traffic (e.g., 5%) to the new version. Monitor. Gradually increase to 100% if metrics are healthy. Roll back if they degrade. Best for risk mitigation.

Exercise 15.1: CI/CD Pipeline Design

Exercise: Design a CI/CD pipeline for a Python-based structural analysis API.

Constraints:

  • Zero-downtime deployments (engineers use the API constantly during business hours)
  • Simulation jobs in progress must not be interrupted by deployments
  • The pipeline should catch performance regressions before production
  • Rollback must be possible in < 5 minutes

Tasks:

  1. Draw the pipeline as a flowchart (stages, gates, parallel vs. sequential)
  2. Define what runs in CI vs. CD
  3. Choose a deployment strategy and justify it
  4. Define the monitoring metrics you would check after each deployment
  5. Describe your rollback procedure
  6. How do you handle in-progress simulation jobs during a deployment?

Quiz

After deploying a new version, error rates spike from 0.1% to 8%. What is the correct sequence of actions?

  • A) Investigate the cause, fix the code, and deploy again
  • B) Immediately roll back to the previous version, then investigate
  • C) Scale up the number of instances to handle the increased error load
  • D) Wait 30 minutes to see if the errors stabilize on their own
Answer

B) Immediately roll back to the previous version, then investigate.

In production, the priority is minimizing user impact. Rollback is fast (minutes) and definitive. Investigation and fix can happen safely on the previous stable version without continuing to impact users. The principle: restore service first, understand cause second, fix cause third.