Networking & Distributed Systems Concepts

Why Networking Matters for Engineers

Engineering software increasingly lives in distributed environments — cloud simulations, IoT sensor networks, collaborative analysis platforms, digital twins. Every interaction between these components crosses a network. Understanding how networks work, what can go wrong, and how distributed systems manage failure is essential.

The Networking Stack

Layer 4 — Transport (TCP/UDP)

  • TCP (Transmission Control Protocol): Reliable, ordered delivery. The OS retransmits lost packets. Used for everything where data integrity matters (HTTP, database connections, file transfer).
  • UDP (User Datagram Protocol): Fast, unreliable, unordered. No retransmission. Used where latency matters more than completeness (video streaming, DNS, some sensor telemetry).

Layer 7 — Application (HTTP/HTTPS, WebSocket, gRPC)

  • HTTP/REST: Request-response. Client sends request, server responds. Stateless. The foundation of web APIs.
  • WebSocket: Bidirectional, persistent connection. Ideal for real-time data (live sensor dashboards).
  • gRPC: High-performance binary protocol (Protocol Buffers). Used for internal service-to-service communication.

HTTP and REST APIs

REST (Representational State Transfer) is a set of conventions for HTTP APIs:

GET    /api/v1/jobs              → List all jobs
GET    /api/v1/jobs/42           → Get job 42
POST   /api/v1/jobs              → Create a new job
PUT    /api/v1/jobs/42           → Update job 42 (full replacement)
PATCH  /api/v1/jobs/42           → Partial update of job 42
DELETE /api/v1/jobs/42           → Delete job 42

HTTP status codes you must know:

CodeMeaning
200 OKSuccess
201 CreatedResource created (response to POST)
400 Bad RequestClient sent malformed data
401 UnauthorizedAuthentication required
403 ForbiddenAuthenticated but not permitted
404 Not FoundResource doesn’t exist
500 Internal Server ErrorSomething broke on the server
503 Service UnavailableServer is down or overloaded

Latency and Its Implications

OperationTypical Latency
CPU cache access~1 ns
RAM access~100 ns
SSD read~100 μs
Network round-trip (same data center)~0.5 ms
Network round-trip (cross-continent)~150 ms
HDD read~10 ms

Key takeaway: A loop that makes 1,000 API calls serially (each 50ms) takes 50 seconds. Made concurrently, it takes 50ms. This is why “fetch all sensor readings in a loop” is an anti-pattern. Use batch APIs.

Distributed Systems Challenges

When components run on different machines, new failure modes appear:

  • Partial failure: Service A is up, Service B is down. The system is degraded — not fully up, not fully down.
  • Network partition: A is up, B is up, but they can’t communicate. Both think the other is down.
  • Clock skew: Clocks on different machines drift. “Event A happened before Event B” is surprisingly hard to determine.
  • The CAP Theorem: During a network partition, you must choose: serve potentially stale data (Availability) or refuse to serve (Consistency).

Fault Tolerance Patterns

Retry with Exponential Backoff

import time
import random

def retry_with_backoff(fn, max_retries=5, base_delay=1.0):
    for attempt in range(max_retries):
        try:
            return fn()
        except TemporaryError as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

Circuit breaker: If a downstream service is failing repeatedly, stop calling it for a period and return a cached or default response.

Timeout: Every network call must have a timeout. A call that never times out can hold resources indefinitely.

Bulkhead: Isolate resource pools so one failing component doesn’t exhaust resources for others. Named after ship bulkheads that contain flooding.

Exercise 13.1: API Design for a Simulation Platform

Exercise: Design the REST API for a cloud structural analysis platform. Design the following endpoints (method, URL, request body, response body, status codes):

  1. Submit a new analysis job
  2. Get the status of a specific job
  3. List all jobs for the current user (with filtering by status and date range)
  4. Retrieve the results of a completed job
  5. Cancel a running job
  6. Upload a structural model file

For each endpoint: define the full URL, query parameters, JSON schema, all possible status codes, and what happens if the service is temporarily unavailable.

Quiz

An engineering dashboard makes 50 separate API calls to fetch sensor readings, one per sensor. Each call takes 80ms. The dashboard takes 4 seconds to load. What is the best fix?

  • A) Upgrade the server hardware to reduce per-call latency
  • B) Use a batch endpoint or fetch calls concurrently to reduce total wait time to ~80ms
  • C) Cache the sensor readings client-side to avoid API calls
  • D) Increase the API timeout threshold
Answer

B) Use a batch endpoint or fetch calls concurrently to reduce total wait time to ~80ms.

50 sequential calls × 80ms = 4,000ms = 4 seconds. The solution is either: (1) create a batch API endpoint (GET /api/sensors/batch?ids=1,2,3,...) that returns all 50 readings in one round trip (~80ms total), or (2) make all 50 calls concurrently with asyncio.gather() or Promise.all(), so they all execute in parallel and the total wait is ~80ms.