Testing, Quality Assurance & Reliability

Why Testing Is Engineering

No civil engineer would build a bridge and skip the load test. No structural engineer would submit a design without checking it against code requirements. Yet engineers who write software routinely ship code without tests, then act surprised when it breaks in production.

Testing isn’t a chore bolted onto development. It’s part of engineering. A bridge has safety factors. Software has tests. Both exist because humans make mistakes, requirements change, and systems interact in unexpected ways. The question isn’t “should we test?” — it’s “what is our testing strategy?”

The Testing Pyramid

The testing pyramid is a model for how to distribute your testing effort across different levels of granularity.

          /\
         /  \        E2E Tests (few)
        / E2E\       - Full system, browser/API
       /------\      - Slow, brittle, expensive
      /        \
     / Integra- \    Integration Tests (some)
    /   tion     \   - Multiple components together
   /--------------\  - Database, API calls, file I/O
  /                \
 /    Unit Tests    \ Unit Tests (many)
/____________________\- Single function/class
                      - Fast, isolated, cheap

The engineering analogy:

  • Unit tests are like testing individual steel members — does this beam handle the expected load? Does this column resist the expected compression? Fast, focused, and you run thousands of them.
  • Integration tests are like testing assemblies — does the beam-column connection transfer loads correctly? Do the foundation and superstructure work together? Slower, but catches interface problems.
  • End-to-end (E2E) tests are like the final load test on the completed bridge — does the whole system perform under realistic conditions? Slow, expensive, and you run a few critical ones.

Key takeaway: Most of your tests should be unit tests (fast, cheap, many). Fewer integration tests. Fewest E2E tests. If your pyramid is inverted (lots of E2E, few unit tests), your test suite is slow, brittle, and expensive to maintain.

Unit Testing in Practice

A unit test verifies that a single function or method produces the correct output for a given input. It runs in isolation — no database, no network, no file system.

# stress_check.py
def calculate_stress_ratio(applied_stress, yield_strength):
    """Calculate the demand-to-capacity ratio for a structural member.

    Args:
        applied_stress: Applied stress in MPa (must be non-negative).
        yield_strength: Material yield strength in MPa (must be positive).

    Returns:
        Stress ratio (demand/capacity). Values > 1.0 indicate failure.

    Raises:
        ValueError: If inputs are invalid.
    """
    if applied_stress < 0:
        raise ValueError("Applied stress cannot be negative")
    if yield_strength <= 0:
        raise ValueError("Yield strength must be positive")

    return applied_stress / yield_strength
# test_stress_check.py
import pytest
from stress_check import calculate_stress_ratio


def test_normal_stress_ratio():
    """Member under normal load returns ratio less than 1."""
    ratio = calculate_stress_ratio(150.0, 250.0)
    assert ratio == pytest.approx(0.6)


def test_at_yield():
    """Member at exactly yield strength returns ratio of 1.0."""
    ratio = calculate_stress_ratio(250.0, 250.0)
    assert ratio == pytest.approx(1.0)


def test_over_yield():
    """Member beyond yield returns ratio greater than 1."""
    ratio = calculate_stress_ratio(300.0, 250.0)
    assert ratio == pytest.approx(1.2)


def test_zero_stress():
    """Zero applied stress returns zero ratio."""
    ratio = calculate_stress_ratio(0.0, 250.0)
    assert ratio == pytest.approx(0.0)


def test_negative_stress_raises():
    """Negative applied stress raises ValueError."""
    with pytest.raises(ValueError, match="cannot be negative"):
        calculate_stress_ratio(-10.0, 250.0)


def test_zero_yield_raises():
    """Zero yield strength raises ValueError."""
    with pytest.raises(ValueError, match="must be positive"):
        calculate_stress_ratio(150.0, 0.0)


def test_negative_yield_raises():
    """Negative yield strength raises ValueError."""
    with pytest.raises(ValueError, match="must be positive"):
        calculate_stress_ratio(150.0, -250.0)

Notice the pattern: each test has a descriptive name, tests one specific behavior, and checks both normal and edge cases. The pytest.approx handles floating-point comparison properly — never compare floats with ==.

Tip: Use pytest.approx() for floating-point comparisons. 0.1 + 0.2 == 0.3 is False in Python (and every other IEEE 754 language). pytest.approx(0.3) handles this correctly with a configurable tolerance.

Test-Driven Development (TDD)

TDD inverts the usual workflow: you write the test before the code. The cycle has three steps:

  1. Red: Write a test for behavior that doesn’t exist yet. Run it. It fails (red).
  2. Green: Write the minimum code to make the test pass. Run it. It passes (green).
  3. Refactor: Clean up the code while keeping all tests green. Remove duplication, improve naming, simplify logic.
# Step 1: RED - Write the test first
def test_safety_factor():
    """Safety factor is yield_strength / applied_stress."""
    sf = calculate_safety_factor(applied_stress=150.0, yield_strength=250.0)
    assert sf == pytest.approx(1.667, rel=1e-3)

# Run: FAILS (function doesn’t exist yet)

# Step 2: GREEN - Write minimum code to pass
def calculate_safety_factor(applied_stress, yield_strength):
    return yield_strength / applied_stress

# Run: PASSES

# Step 3: REFACTOR - Add validation, improve
def calculate_safety_factor(applied_stress, yield_strength):
    if applied_stress <= 0:
        raise ValueError("Applied stress must be positive")
    if yield_strength <= 0:
        raise ValueError("Yield strength must be positive")
    return yield_strength / applied_stress

# Run: Still passes. Add tests for the new validation.

TDD isn’t about testing — it’s about design. Writing the test first forces you to think about the function’s interface before its implementation. What are the inputs? What are the outputs? What should happen with invalid inputs? You make these decisions when your mind is clearest — before you’re knee-deep in implementation details.

Key takeaway: TDD produces code that is testable by construction. If you can’t write a test for a function, the function’s design is probably wrong — it has too many responsibilities, hidden dependencies, or side effects.

Edge Cases: The Safety Factor Mindset

In structural engineering, you design for the worst case: maximum load, minimum material strength, worst-case temperature. In software, the equivalent is edge case testing — testing the boundaries and unusual inputs that the happy path ignores.

Consider a mesh loader that reads node coordinates from a file:

# Edge cases for a mesh file loader
def test_empty_file():
    """Empty file returns empty mesh, not a crash."""
    mesh = load_mesh("empty.msh")
    assert mesh.num_nodes == 0
    assert mesh.num_elements == 0


def test_single_node():
    """File with one node and zero elements is valid."""
    mesh = load_mesh("single_node.msh")
    assert mesh.num_nodes == 1
    assert mesh.num_elements == 0


def test_duplicate_node_ids():
    """Duplicate node IDs raise a clear error."""
    with pytest.raises(MeshError, match="Duplicate node ID"):
        load_mesh("duplicate_nodes.msh")


def test_missing_connectivity():
    """Element references non-existent node ID."""
    with pytest.raises(MeshError, match="Node ID .* not found"):
        load_mesh("bad_connectivity.msh")


def test_very_large_mesh():
    """Mesh with 1 million nodes loads within 10 seconds."""
    import time
    start = time.time()
    mesh = load_mesh("large_mesh.msh")
    elapsed = time.time() - start
    assert mesh.num_nodes == 1_000_000
    assert elapsed < 10.0


def test_nan_coordinates():
    """NaN in coordinates raises a clear error, not silent corruption."""
    with pytest.raises(MeshError, match="Invalid coordinate"):
        load_mesh("nan_coords.msh")


def test_mixed_element_types():
    """File with tris and quads loads both correctly."""
    mesh = load_mesh("mixed_elements.msh")
    assert mesh.num_triangles > 0
    assert mesh.num_quads > 0

Each of these tests represents a scenario that will happen in production. Files get corrupted. Meshes have errors. Users provide unexpected input. The edge case tests are your safety factors.

Tip: When writing tests, ask: “What is the worst input someone could give this function?” Empty input, None, negative numbers, NaN, extremely large values, duplicate values, special characters in strings. Test all of them.

Observability: Knowing What Your System Is Doing

Testing catches bugs before deployment. Observability catches problems after deployment. A production system without observability is a bridge without strain gauges — you only find out something is wrong when it collapses.

The three pillars of observability are logs, metrics, and traces.

Logging

Structured, leveled messages that record what the system is doing. Good logs tell you what happened, when, and in what context.

import logging

logger = logging.getLogger(__name__)

def run_solver(job_id, mesh_path, config):
    logger.info("Starting solver job",
                extra={"job_id": job_id, "mesh": mesh_path,
                       "num_elements": config.num_elements})
    try:
        result = solve(mesh_path, config)
        logger.info("Solver completed successfully",
                     extra={"job_id": job_id,
                            "wall_time_s": result.wall_time,
                            "iterations": result.iterations})
        return result
    except ConvergenceError as e:
        logger.error("Solver failed to converge",
                      extra={"job_id": job_id,
                             "iteration": e.iteration,
                             "residual": e.residual})
        raise
    except Exception as e:
        logger.exception("Unexpected solver failure",
                          extra={"job_id": job_id})
        raise

Log levels:

  • DEBUG: Detailed diagnostic information. Disabled in production.
  • INFO: Normal operations. “Job started,” “Job completed,” “File loaded.”
  • WARNING: Something unexpected but recoverable. “Retry attempt 2 of 3.”
  • ERROR: Something failed. “Solver failed to converge.”
  • CRITICAL: System is in a broken state. “Database connection lost.”

Key takeaway: Use structured logging (key-value pairs, not string concatenation). Structured logs can be searched, filtered, and aggregated by tools like ELK (Elasticsearch, Logstash, Kibana) or Grafana Loki. “Show me all ERROR logs for job_id=J-1234” is trivial with structured logs and impossible with unstructured ones.

Metrics

Numeric measurements collected over time. Unlike logs (which record individual events), metrics track aggregates: request rates, error rates, latencies, queue depths, CPU usage.

Key metrics for engineering software:

  • Job throughput: Jobs completed per hour
  • Job failure rate: Percentage of jobs that fail (target: < 1%)
  • Queue depth: Number of jobs waiting (indicates capacity problems)
  • P95 latency: 95th percentile response time (better than average, which hides outliers)
  • Resource utilization: CPU, memory, disk I/O per service

Metrics power dashboards and alerts. “Alert me if the job failure rate exceeds 5% in any 10-minute window” is a metrics-based alert.

Distributed Traces

In a microservices system, a single user request may flow through 5 or 10 services. A distributed trace follows that request end-to-end, showing how long each service took and where the bottleneck is.

Each trace has a unique trace ID that propagates across service boundaries. When the job submission service calls the meshing service, which calls the solver service, all three log entries share the same trace ID.

Engineering context: “Why did job J-1234 take 3 hours instead of the expected 45 minutes?” The trace shows: 2 minutes in the meshing service, 15 minutes waiting in the solver queue (queue was backed up), 2 hours 40 minutes in the solver (mesh was 10x larger than expected), 3 minutes in post-processing. The bottleneck is clear.

Tip: Start with logging and metrics. Add distributed tracing when you move to microservices. A monolith usually doesn’t need tracing — a profiler does the same job.

Exercise 10.1: Test Plan for Eigenvalue Analysis

Write a Test Plan

Scenario: You are writing tests for a function compute_natural_frequencies(stiffness_matrix, mass_matrix) that performs eigenvalue analysis to find the natural frequencies of a structural model. The function:

  • Takes a stiffness matrix K and mass matrix M (both n × n, symmetric, positive semi-definite)
  • Solves the generalized eigenvalue problem Kφ = ω²Mφ
  • Returns a list of natural frequencies (Hz) in ascending order and the corresponding mode shapes

Task: Write a test plan covering:

  1. At least 3 unit tests for normal behavior (use known analytical solutions)
  2. At least 3 edge case tests
  3. At least 1 performance test
  4. At least 2 error handling tests

You don’t need to write the implementation — write the tests as if the function already exists. For known solutions, use a simple spring-mass system where the analytical natural frequency is ω = √(k/m).

Discussion Guide

Unit tests (normal behavior):

  • Single DOF spring-mass: K = [[k]], M = [[m]], expected frequency = √(k/m) / (2π) Hz. Use k=100, m=1 → f = 1.59 Hz.
  • Two DOF system: Known analytical solution for two masses connected by springs. Verify both frequencies match.
  • Frequencies returned in ascending order: For a multi-DOF system, verify freq[0] ≤ freq[1] ≤ freq[2] ≤ …
  • Mode shapes are orthogonal: Verify φiT M φj = 0 for i ≠ j.

Edge cases:

  • Rigid body modes: A free-free beam has zero-frequency rigid body modes. The function should return frequencies near zero (not NaN or negative).
  • Repeated eigenvalues: Symmetric structures have repeated natural frequencies. The function should handle this gracefully.
  • Very large matrix (n = 10,000): Should complete without memory error. Performance test: should complete within a time limit.

Performance test:

  • n = 5,000 DOF matrix should complete within 30 seconds (or whatever your threshold is).

Error handling:

  • Non-square matrix: Should raise ValueError.
  • Mismatched dimensions: K is 100×100, M is 50×50. Should raise ValueError.
  • Non-symmetric matrix: Should raise ValueError (or at least a warning).
  • Singular mass matrix: M with zero diagonal entries. Should raise a meaningful error, not a cryptic LAPACK error.

Quiz

A team deploys an FEA post-processor that works correctly for small meshes (< 10,000 elements) but produces wrong results for large meshes (> 500,000 elements). Their unit tests all pass. What is the most likely root cause, and what type of testing would have caught it?

Answer

Root cause: a test coverage gap at scale. The unit tests only exercise small, synthetic meshes. The bug likely involves integer overflow, memory allocation failure, numerical precision loss at large scale, or an O(n²) algorithm that appeared fast enough on small inputs but produces incorrect results due to floating-point accumulation errors on large inputs.

What would catch it:

  • Integration tests with realistic mesh sizes — test with actual large meshes from production, not just toy examples.
  • Property-based testing — generate random meshes of varying sizes and verify invariants (e.g., stress results should be independent of element numbering order).
  • Performance/regression tests — compare results against a reference solver for large meshes. If the post-processor diverges from the reference above a certain size, the test fails.

This is a classic example of why the testing pyramid matters: unit tests alone are insufficient. Integration tests with realistic data sizes catch problems that small synthetic unit tests miss.