The Cloud Service Model Spectrum
In Lesson 15, we introduced cloud computing and deployment pipelines. Now we go deeper: understanding the spectrum of cloud service models, how to reason about costs, and how to make informed decisions about where your engineering workloads should run.
Cloud services exist on a spectrum from “you manage everything” to “you manage nothing.” Here’s the landscape:
You Manage Everything Provider Manages Everything
←————————————————————————————————————→
On-Premises IaaS PaaS FaaS/Serverless SaaS
+-----------+ +-----------+ +-----------+ +-----------+ +-----------+
| Application| | Application| | Application| | Function | | |
| Data | | Data | | Data | | Code | | |
| Runtime | | Runtime | | | | | | |
| Middleware | | Middleware | | | | | | Fully |
| OS | | OS | | | | | | Managed |
| Virtualizn | | | | | | | | |
| Servers | | | | | | | | |
| Storage | | | | | | | | |
| Networking | | | | | | | | |
+-----------+ +-----------+ +-----------+ +-----------+ +-----------+
You manage: You manage: You manage: You manage: You manage:
EVERYTHING App + OS up App + Data Function code Configuration
Each step to the right trades control for convenience. Each step to the left trades convenience for control. There is no universally correct position on this spectrum — the right choice depends on your workload, team, budget, and compliance requirements.
IaaS: Infrastructure as a Service
Examples: AWS EC2, Google Compute Engine, Azure Virtual Machines, DigitalOcean Droplets
IaaS gives you virtual machines (VMs) in the cloud. You get raw compute, storage, and networking. Everything from the operating system up is your responsibility.
| You Manage | Provider Manages |
|---|---|
| Application code | Physical servers |
| Runtime & dependencies | Networking infrastructure |
| Operating system & patches | Virtualization layer |
| Security configuration | Power, cooling, physical security |
| Scaling decisions | Hardware replacement |
Best for:
- Workloads requiring full OS control (custom kernels, GPU drivers, specific library versions)
- Legacy applications that can’t be easily containerized
- Engineering simulations requiring specific hardware configurations (e.g., 64-core machines with 512 GB RAM for FEM solvers)
- Compliance requirements mandating OS-level security controls
Trade-off: Maximum flexibility, but you’re responsible for patching, scaling, and availability. If your EC2 instance’s underlying hardware fails at 3 AM, AWS will migrate it — but your application needs to handle the restart gracefully.
PaaS: Platform as a Service
Examples: Heroku, AWS Elastic Beanstalk, Google App Engine, Railway, Render
PaaS abstracts away the operating system. You deploy your application code, and the platform handles the runtime, scaling, and infrastructure.
| You Manage | Provider Manages |
|---|---|
| Application code | Operating system |
| Data & configuration | Runtime & middleware |
| Deployment configuration | Scaling infrastructure |
| Load balancing | |
| OS patching & security |
Best for:
- Web applications and APIs where you don’t need OS-level control
- Rapid prototyping and MVPs
- Small teams without dedicated DevOps engineers
- Engineering dashboards, result viewers, and internal tools
Trade-off: Faster deployment and less operational burden, but you’re constrained to the platform’s supported languages, runtimes, and configurations. If your FEM solver requires a specific Fortran compiler with custom flags, PaaS probably won’t work.
FaaS / Serverless: Functions as a Service
Examples: AWS Lambda, Google Cloud Functions, Azure Functions, Cloudflare Workers
Serverless takes abstraction further: you deploy individual functions, and the platform handles everything else. You don’t think about servers at all — you think about events and responses.
Here’s a typical Lambda function for processing uploaded simulation results:
import json
import boto3
s3 = boto3.client("s3")
def handler(event, context):
"""Process a newly uploaded simulation result file."""
bucket = event["Records"][0]["s3"]["bucket"]["name"]
key = event["Records"][0]["s3"]["object"]["key"]
# Download the result file
response = s3.get_object(Bucket=bucket, Key=key)
data = json.loads(response["Body"].read().decode("utf-8"))
# Extract summary statistics
summary = {
"max_stress": max(data["stress_values"]),
"min_stress": min(data["stress_values"]),
"mean_stress": sum(data["stress_values"]) / len(data["stress_values"]),
"node_count": len(data["stress_values"]),
"source_file": key,
}
# Store summary in results bucket
s3.put_object(
Bucket="simulation-summaries",
Key=key.replace(".json", "-summary.json"),
Body=json.dumps(summary),
)
return {"statusCode": 200, "body": json.dumps(summary)}
Best for:
- Event-driven processing (file uploads, API requests, queue messages)
- Lightweight data transformations and validations
- Glue logic between services
- Infrequent workloads where you don’t want to pay for idle servers
Limitations you must understand:
| Constraint | Typical Limit (AWS Lambda) | Impact on Engineering Workloads |
|---|---|---|
| Cold start latency | 100ms – 10s (language-dependent) | Unacceptable for real-time control systems |
| Execution time limit | 15 minutes | Cannot run FEM simulations (hours/days) |
| Memory limit | 10 GB | Cannot load large mesh files in memory |
| CPU allocation | Proportional to memory (up to 6 vCPUs) | Not suitable for CPU-intensive solvers |
| Deployment package size | 250 MB (unzipped) | Large scientific libraries may not fit |
| No persistent local state | 512 MB /tmp | Cannot store intermediate results locally |
SaaS: Software as a Service
Examples: GitHub, Slack, Jira, Salesforce, Google Workspace, ANSYS Cloud
SaaS is the far end of the spectrum: the provider manages everything. You use the software through a web browser or API. You manage only your data and configuration.
For engineering teams, SaaS is increasingly relevant: cloud-based FEM solvers (ANSYS Cloud, SimScale), project management (Jira, Linear), collaboration (Slack, Teams), and version control (GitHub, GitLab). The trade-off is stark: zero operational burden, but you’re completely dependent on the provider’s roadmap, pricing, and availability.
Cost Modeling: Thinking Like a Financial Engineer
Cloud costs are the most frequently underestimated aspect of cloud migration. Engineers who understand cloud cost drivers make better architectural decisions.
Key Cost Drivers
| Cost Category | What Drives It | How to Control It |
|---|---|---|
| Compute | Instance type, hours running, number of instances | Right-sizing, spot instances, auto-scaling, reserved instances |
| Storage | Volume (GB), storage class, IOPS requirements | Lifecycle policies, tiered storage, compression |
| Data Transfer | Data leaving the cloud (egress), cross-region transfer | CDN caching, keeping processing near data, compression |
| Managed Services | Databases, queues, load balancers, API gateways | Evaluate build-vs-buy for each service |
Rough Estimation Example
Scenario: An engineering firm runs 500 FEM simulation jobs per day. Each job requires 16 vCPUs and 64 GB RAM for approximately 2 hours.
Job Requirements:
- 500 jobs/day × 2 hours/job = 1,000 compute-hours/day
- Instance type: c6i.4xlarge (16 vCPUs, 32 GB RAM)
(Actually need 64 GB → r6i.2xlarge: 8 vCPUs, 64 GB. Adjust.)
- Corrected: r6i.4xlarge (16 vCPUs, 128 GB RAM) — oversized on RAM
- Better: r6i.2xlarge (8 vCPUs, 64 GB) × 2 per job = 16 vCPUs effective
Cost Comparison (approximate, us-east-1, 2026 pricing):
Option 1: On-Demand
$0.504/hr × 2 instances × 2 hrs × 500 jobs = $1,008/day
Monthly: ~$30,240
Option 2: Spot Instances (70% discount typical)
$0.151/hr × 2 instances × 2 hrs × 500 jobs = $302/day
Monthly: ~$9,060
Risk: Spot instances can be interrupted. Need checkpointing.
Option 3: Reserved Instances (1-year, no upfront, ~40% discount)
Need enough capacity for peak: ~42 concurrent instances
$0.302/hr × 42 instances × 24 hrs × 30 days = $9,132/month
But only using them ~12 hrs/day effectively → 50% waste
Effective monthly: $9,132 (still cheaper than on-demand)
Option 4: Spot + On-Demand hybrid
80% spot ($0.151/hr) + 20% on-demand ($0.504/hr) fallback
Monthly: ~$12,100
Best reliability/cost balance
AI-Augmented Infrastructure Decisions
AI tools can generate infrastructure configurations quickly. But they often lack context about engineering workload characteristics. Here’s a real example of an engineer correcting AI-generated infrastructure:
Prompt to AI: “Design an AWS architecture for running FEM simulation jobs submitted via a web API.”
AI-generated suggestion: “Use AWS Lambda triggered by API Gateway. Store input files in S3. Lambda processes the simulation and stores results back in S3.”
Engineer’s correction:
# AI suggested Lambda — but FEM jobs run for 2+ hours.
# Lambda has a 15-minute timeout. This architecture will fail.
# Corrected architecture:
# 1. API Gateway + Lambda: Receives job submission (lightweight, < 1 sec)
# 2. Lambda writes job metadata to SQS queue
# 3. ECS Fargate (or EC2 Spot) pulls jobs from SQS
# 4. Fargate runs the actual FEM solver (no time limit)
# 5. Results stored in S3
# 6. SNS notification sent when complete
# Why Fargate with Spot:
# - No time limit (unlike Lambda)
# - No server management (unlike raw EC2)
# - Spot pricing (~70% discount)
# - Auto-scales to zero when no jobs
# - Containers can include FEM solver + dependencies
# Why not Lambda:
# - 15-min timeout is a hard limit (cannot be changed)
# - 10 GB memory limit (large meshes need 64+ GB)
# - 6 vCPU max (FEM solvers need 16-32 cores)
# - Cold start latency wastes solver initialization time
Exercise 16.1: Cloud Architecture Cost Model
Scenario: A structural engineering consultancy serves 100 client firms. Each firm submits an average of 10 simulation jobs per day. Each job requires between 8 and 32 vCPUs (varying by complexity) and runs for 1–4 hours. The system must also support 5 concurrent tasks:
- Job submission API — accepts job definitions via REST, validates input, stores files
- Job queue management — prioritizes jobs, manages scheduling
- Simulation execution — runs the actual FEM solver
- Result processing — extracts summaries, generates reports
- Client dashboard — web application showing job status and results
For each task, decide:
- Which cloud service model (IaaS, PaaS, FaaS, SaaS)?
- Which specific AWS/GCP/Azure service?
- Estimated monthly cost (rough order of magnitude)
Present your answer as a table with columns: Task, Service Model, Specific Service, Justification, and Estimated Monthly Cost.
Stretch goal: Calculate the total monthly cost and compare it against the cost of buying equivalent on-premises hardware (assume a 3-year amortization and 20% annual maintenance cost).
Quiz
Question: Your team runs a machine learning training job for 2 hours every day. The job requires a GPU instance costing $3.06/hour on-demand or $1,500/month for a reserved instance. Which pricing model is more cost-effective?
- Reserved instance, because you use it every day.
- On-demand, because 2 hours/day is only about 8% utilization.
- Spot instance, because ML training can always be interrupted.
- Serverless, because the job only runs for 2 hours.
Answer
b) On-demand, because 2 hours/day is only about 8% utilization.
Let’s do the math: On-demand cost = $3.06/hr × 2 hrs/day × 30 days = $183.60/month. Reserved cost = $1,500/month. On-demand is 8x cheaper for this usage pattern. Reserved instances only make sense when utilization is high enough that the discount overcomes the commitment to 24/7 payment. At 2 hours/day (~8% utilization), you’re paying for 22 hours of idle time with a reserved instance. Option (c) is tempting but incorrect because ML training jobs often cannot tolerate interruption mid-epoch without losing progress. Option (d) is wrong because GPU workloads typically exceed serverless constraints.