The Deployment Gap
Writing code is not shipping software. Between “it works on my laptop” and “it runs reliably in production for 1,000 users” lies a gap that kills many engineering software projects. DevOps practices and cloud infrastructure are how that gap is bridged.
Containers at Scale: From Docker to Orchestration
You’ve run Docker containers. In production, you have dozens or hundreds of containers to manage: start them, stop them, restart crashed ones, load-balance traffic between them, and scale them up when demand increases.
Kubernetes (K8s) is the industry standard for container orchestration:
# Kubernetes Deployment: run 3 instances of the simulation API
apiVersion: apps/v1
kind: Deployment
metadata:
name: simulation-api
spec:
replicas: 3
selector:
matchLabels:
app: simulation-api
template:
metadata:
labels:
app: simulation-api
spec:
containers:
- name: api
image: myrepo/simulation-api:v1.2.3
ports:
- containerPort: 8080
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "1000m"
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
Kubernetes handles: automatic restart of failed containers, rolling deployment (update without downtime), scaling (increase replicas based on CPU/memory), service discovery, and load balancing.
Infrastructure as Code (IaC)
Infrastructure as Code means defining your cloud infrastructure in code files (version-controlled, reviewable, repeatable) rather than clicking through cloud consoles.
Tools: Terraform (provider-agnostic), AWS CloudFormation (AWS-native), Pulumi (real programming languages).
# Terraform: provision a simulation compute cluster on AWS
resource "aws_instance" "sim_worker" {
count = 4
ami = "ami-0abcdef1234567890"
instance_type = "c6i.8xlarge" # 32 vCPUs, 64GB RAM
tags = {
Name = "sim-worker-${count.index}"
Team = "structural-analysis"
}
}
resource "aws_s3_bucket" "simulation_results" {
bucket = "my-firm-simulation-results-2026"
}
resource "aws_sqs_queue" "job_queue" {
name = "simulation-jobs"
message_retention_seconds = 86400
visibility_timeout_seconds = 3600
}
Why IaC matters: Your compute cluster, storage, and network are now version-controlled alongside your solver code. Identical environments for dev, test, and production. Disaster recovery is: terraform apply. Auditing is: git log.
CI/CD Pipelines in Full
A complete CI/CD pipeline for a simulation platform:
Developer pushes to feature branch
│
▼
[CI: Lint + Type Check] (seconds)
│
▼
[CI: Unit Tests] (1–5 min)
│
▼
[CI: Integration Tests] (5–30 min)
│
▼
[CI: Build Docker Image] (2–5 min)
│
▼
[CD: Deploy to Staging] (2 min)
│
▼
[CD: Smoke Tests on Staging] (5 min)
│
▼ (manual approval gate)
[CD: Deploy to Production] (rolling: 5 min, zero downtime)
│
▼
[Monitoring: Check error rates, latency]
│
▼ (automated rollback if metrics degrade)
Deployment Strategies
Big-bang deployment: Take the old version down, put the new version up. Simple. Causes downtime. Only acceptable for maintenance windows.
Rolling deployment: Replace instances one at a time. At any moment, some run the old version, some run the new. No downtime, but old and new must be compatible simultaneously.
Blue/green deployment: Two identical environments (blue = current, green = new). Deploy to green, test it, then switch traffic. Rollback is instant (switch back to blue). Costs double the infrastructure.
Canary deployment: Route a small percentage of traffic (e.g., 5%) to the new version. Monitor. Gradually increase to 100% if metrics are healthy. Roll back if they degrade. Best for risk mitigation.
Exercise 15.1: CI/CD Pipeline Design
Exercise: Design a CI/CD pipeline for a Python-based structural analysis API.
Constraints:
- Zero-downtime deployments (engineers use the API constantly during business hours)
- Simulation jobs in progress must not be interrupted by deployments
- The pipeline should catch performance regressions before production
- Rollback must be possible in < 5 minutes
Tasks:
- Draw the pipeline as a flowchart (stages, gates, parallel vs. sequential)
- Define what runs in CI vs. CD
- Choose a deployment strategy and justify it
- Define the monitoring metrics you would check after each deployment
- Describe your rollback procedure
- How do you handle in-progress simulation jobs during a deployment?
Quiz
After deploying a new version, error rates spike from 0.1% to 8%. What is the correct sequence of actions?
- A) Investigate the cause, fix the code, and deploy again
- B) Immediately roll back to the previous version, then investigate
- C) Scale up the number of instances to handle the increased error load
- D) Wait 30 minutes to see if the errors stabilize on their own
Answer
B) Immediately roll back to the previous version, then investigate.
In production, the priority is minimizing user impact. Rollback is fast (minutes) and definitive. Investigation and fix can happen safely on the previous stable version without continuing to impact users. The principle: restore service first, understand cause second, fix cause third.