A lightweight simulator for GPU cluster capacity planning for AI workloads.
Models utilization, queueing delay, packing / fragmentation, and a simplified cost view.
This project is intentionally infrastructure-focused.
It demonstrates how AI compute systems behave under load, rather than how to train or optimize ML models.
Given a cluster configuration and workload arrival patterns, the simulator estimates:
This mirrors the real tradeoffs faced in AI infrastructure, platform, and MLOps environments.
Execution flow
python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
pip install -e ".[dev]"
gpu-sim run --config configs/base.yaml \
--out results/base_tick.csv \
--out-jobs results/base_jobs.csv
Outputs
results/*_tick.csv → utilization, queue depth, running/completed jobsresults/*_jobs.csv → per-job wait time, assignment, SLA violations| Config | Purpose |
|---|---|
configs/base.yaml |
Balanced training + inference baseline |
configs/mixed_workloads.yaml |
Contention and fragmentation under mixed loads |
configs/inference_spike.yaml |
SLA pressure during inference bursts |
configs/nvl_vs_pcie.yaml |
Sensitivity to reduced interconnect efficiency |
Run all scenarios:
gpu-sim run --config configs/base.yaml --out results/base_tick.csv --out-jobs results/base_jobs.csv
gpu-sim run --config configs/mixed_workloads.yaml --out results/mixed_tick.csv --out-jobs results/mixed_jobs.csv
gpu-sim run --config configs/inference_spike.yaml --out results/spike_tick.csv --out-jobs results/spike_jobs.csv
gpu-sim run --config configs/nvl_vs_pcie.yaml --out results/pcie_tick.csv --out-jobs results/pcie_jobs.csv

Insight

Insight
Run an interactive dashboard to compare scenarios and visualize utilization and queueing behavior.
pip install -r requirements.txt
streamlit run app.py
The dashboard supports:
pytest -q
ruff check .
CI runs automatically via GitHub Actions.
AI infrastructure performance is rarely a pure “GPU count” problem.
This simulator highlights why real-world outcomes depend on:
These considerations are central to modern AI platforms, cloud infrastructure, and MLOps systems.