Why Networking Matters for Engineers
Engineering software increasingly lives in distributed environments — cloud simulations, IoT sensor networks, collaborative analysis platforms, digital twins. Every interaction between these components crosses a network. Understanding how networks work, what can go wrong, and how distributed systems manage failure is essential.
The Networking Stack
Layer 4 — Transport (TCP/UDP)
- TCP (Transmission Control Protocol): Reliable, ordered delivery. The OS retransmits lost packets. Used for everything where data integrity matters (HTTP, database connections, file transfer).
- UDP (User Datagram Protocol): Fast, unreliable, unordered. No retransmission. Used where latency matters more than completeness (video streaming, DNS, some sensor telemetry).
Layer 7 — Application (HTTP/HTTPS, WebSocket, gRPC)
- HTTP/REST: Request-response. Client sends request, server responds. Stateless. The foundation of web APIs.
- WebSocket: Bidirectional, persistent connection. Ideal for real-time data (live sensor dashboards).
- gRPC: High-performance binary protocol (Protocol Buffers). Used for internal service-to-service communication.
HTTP and REST APIs
REST (Representational State Transfer) is a set of conventions for HTTP APIs:
GET /api/v1/jobs → List all jobs
GET /api/v1/jobs/42 → Get job 42
POST /api/v1/jobs → Create a new job
PUT /api/v1/jobs/42 → Update job 42 (full replacement)
PATCH /api/v1/jobs/42 → Partial update of job 42
DELETE /api/v1/jobs/42 → Delete job 42
HTTP status codes you must know:
| Code | Meaning |
|---|---|
200 OK | Success |
201 Created | Resource created (response to POST) |
400 Bad Request | Client sent malformed data |
401 Unauthorized | Authentication required |
403 Forbidden | Authenticated but not permitted |
404 Not Found | Resource doesn’t exist |
500 Internal Server Error | Something broke on the server |
503 Service Unavailable | Server is down or overloaded |
Latency and Its Implications
| Operation | Typical Latency |
|---|---|
| CPU cache access | ~1 ns |
| RAM access | ~100 ns |
| SSD read | ~100 μs |
| Network round-trip (same data center) | ~0.5 ms |
| Network round-trip (cross-continent) | ~150 ms |
| HDD read | ~10 ms |
Key takeaway: A loop that makes 1,000 API calls serially (each 50ms) takes 50 seconds. Made concurrently, it takes 50ms. This is why “fetch all sensor readings in a loop” is an anti-pattern. Use batch APIs.
Distributed Systems Challenges
When components run on different machines, new failure modes appear:
- Partial failure: Service A is up, Service B is down. The system is degraded — not fully up, not fully down.
- Network partition: A is up, B is up, but they can’t communicate. Both think the other is down.
- Clock skew: Clocks on different machines drift. “Event A happened before Event B” is surprisingly hard to determine.
- The CAP Theorem: During a network partition, you must choose: serve potentially stale data (Availability) or refuse to serve (Consistency).
Fault Tolerance Patterns
Retry with Exponential Backoff
import time
import random
def retry_with_backoff(fn, max_retries=5, base_delay=1.0):
for attempt in range(max_retries):
try:
return fn()
except TemporaryError as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
time.sleep(delay)
Circuit breaker: If a downstream service is failing repeatedly, stop calling it for a period and return a cached or default response.
Timeout: Every network call must have a timeout. A call that never times out can hold resources indefinitely.
Bulkhead: Isolate resource pools so one failing component doesn’t exhaust resources for others. Named after ship bulkheads that contain flooding.
Exercise 13.1: API Design for a Simulation Platform
Exercise: Design the REST API for a cloud structural analysis platform. Design the following endpoints (method, URL, request body, response body, status codes):
- Submit a new analysis job
- Get the status of a specific job
- List all jobs for the current user (with filtering by status and date range)
- Retrieve the results of a completed job
- Cancel a running job
- Upload a structural model file
For each endpoint: define the full URL, query parameters, JSON schema, all possible status codes, and what happens if the service is temporarily unavailable.
Quiz
An engineering dashboard makes 50 separate API calls to fetch sensor readings, one per sensor. Each call takes 80ms. The dashboard takes 4 seconds to load. What is the best fix?
- A) Upgrade the server hardware to reduce per-call latency
- B) Use a batch endpoint or fetch calls concurrently to reduce total wait time to ~80ms
- C) Cache the sensor readings client-side to avoid API calls
- D) Increase the API timeout threshold
Answer
B) Use a batch endpoint or fetch calls concurrently to reduce total wait time to ~80ms.
50 sequential calls × 80ms = 4,000ms = 4 seconds. The solution is either: (1) create a batch API endpoint (GET /api/sensors/batch?ids=1,2,3,...) that returns all 50 readings in one round trip (~80ms total), or (2) make all 50 calls concurrently with asyncio.gather() or Promise.all(), so they all execute in parallel and the total wait is ~80ms.