Add files via upload

sueszli · web-flow · commit 947467ca9ff7 · 2025-03-18T22:52:09.000+01:00
diff --git a/climate - ai-ml in the era of climate change 194.125/paper.md b/climate - ai-ml in the era of climate change 194.125/paper.md
@@ -0,0 +1,79 @@
+> Samsi, S., Zhao, D., McDonald, J., Li, B., Michaleas, A., Jones, M., ... & Gadepally, V. (2023, September). From words to watts: Benchmarking the energy costs of large language model inference. In 2023 IEEE High Performance Extreme Computing Conference (HPEC) (pp. 1-9). IEEE. – https://arxiv.org/pdf/2310.03003
+
+- inference has been less optimized than training
+- but inference calls happen more frequently
+- paper benchmarks inference energy costs of
+
+*setup*
+
+- model: meta llama (decoder only, transformer based)
+	- 7b, 13b, 65b (largest model)
+	- batchsize=64
+	- maxtokens=128
+- datasets: alpaca, gsm8k
+	- 4,096 samples
+- multi gpu model sharding, <32 gpus
+	- pytorch fairscale
+	- $\tau$=0.8, top-p=0.95 (common values, no tuning)
+	- no quantization
+- MIT supercloud hpc system:
+	- 448 compute nodes
+	- xeon cpu
+	- 2x v100 35gb gpu (250w) → for 8, 16, 32 shards
+	- 4x a100 80gb gpu (300w) → for smaller shards
+	- maximum power draw capped at 250w
+	- omnipath, 25gb ethernet
+- metrics:
+	- perf, latency, energy cost (but not correctness/quality)
+	- total energy consumption divided by num of nodes (not fine granular)
+
+*A1 - fig 2: tokens/s vs. gpu vs. model size vs. dataset*
+
+- a100 >> v100 (by 1x-2x)
+- gsm8k dataset seems easier
+- llama7b >> llama65b (by 3x-5x)
+
+*A2 - fig 3: energy/s vs. gpu vs. model size vs. dataset*
+
+- joul per second
+- a100 >> v100
+- both datasets use the same energy
+- llama7b >> llama65b (exponentially more, disproportionate to increase in performance)
+
+*B - fig 4, 5: shards vs. batch size vs. model size vs. dataset*
+
+- more batches don't need more energy per token
+- more shards need more energy (circa proportional to batch size increase)
+- llama65b ranges 300w-1000w
+
+*C - fig 6, 7: shards vs. token size vs. energy vs. dataset*
+
+- llama65b only
+- max generation length doesn't matter
+- there is a sweet spot where energy per token generated drops with increasing batch size
+
+*D - fig 8, 9: shards vs. token size vs. energy vs. dataset*
+
+- gsm8k, 512 generation length, 16 shards, max batch size → best energy efficiency
+
+*E - power cap*
+
+- capping from 250w to 175w: 6.7% slower inference
+- capping from 175w to 150w: 19.49% slower inference
+- static power cap not recommended
+
+*F - gpu utilization in distributed inference*
+
+- 94-98% $\pm$ 23-27% utilization
+- higher with longer token size
+
+*conclusion*
+
+- power capping can be an effective tool for reducing inference energy
+
+*outlook*
+
+- hyperparam search
+- a single gpu could be shared by multiple models, with minimal degradation
+- model quantization, distillation, sparsification
+- custom, energy-efficient hardware
diff --git a/climate - ai-ml in the era of climate change 194.125/summary.md b/climate - ai-ml in the era of climate change 194.125/summary.md
@@ -0,0 +1,81 @@
+# performance engineering
+
+*goal*
+
+- memory and compute efficiency, at training and inference
+- improved performance, cost, sustainability, …
+- lower resource consumption = power, carbon footprint, co2 emission, water usage, …
+
+*metrics*
+
+- datacenter power usage efficiency (pue)
+	- = total energy use (also cooling) / computing energy use
+	- datacenter efficiency
+	- 1 means no overhead, 1.67 on average
+- datacenter carbon intensity
+	- = tCO₂e/MWh
+	- = tons of carbon dioxide and all equivalent greenhouse gasses / one million watt used for 1h
+	- cleanliness of datacenter energy
+- efficiency
+	- = flops / joule
+	- = floating point ops per second / watt
+- training energy consumption
+	- = $\text{pue} \cdot t \cdot e$
+	- where $t$ is the train time, $e$ is the compute energy consumption
+	- for federated learning: sum across all rounds and all devices used in that round
+	- for network routers: consumption depends on mb up/down
+
+*overview*
+
+- using renewable energy:
+	- moving datacenters closer to water
+- reducing power consumption:
+	- sparse models (instead of dense models)
+	- zero shot models
+- improving efficiency:
+	- model optimization
+	- infrastructure optimization (accelerator hardware, cloud, edge → more optimized than on-prem)
+
+*inference optimization*
+
+- just as important as training: less compute but many invocations
+- quantization
+	- lower precision of weights
+	- pre-trainin vs. post-training
+- pruning
+	- drops unnecessary nodes and edges (structured) vs. weights (unstructured)
+- distillation
+	- student learns logits from teacher model, instead of labels
+- cascading
+	- chaining models, in increasing complexity
+- context-aware model selection
+	- choosing specialized model at runtime
+	- self-adaptive system
+	- automatically selecting model and infrastructure, based on traffic
+
+*training optimization*
+
+- federated learning
+	- utilizing otherwise idle devices
+	- inefficient: high synchronization overhead, has multiple rounds, …
+- edge computing
+- batches
+- setting gpu power limit for gpus
+	- trades off with training-time
+	- energy bloat = finishing training faster than necessary, at higher energy cost
+	- intrinsic bloat = gpus with less work are finishing too fast
+	- extrinsic bloat = ie. hardware failure (straggler) limits throughput, other gpus are finishing too fast 
+
+# ml4climate
+
+- using models to help with climate change
+- sustainability development goals (sdg) by united nations
+- decision making, forecasting, physics simulations, analytical models → can outperform traditional models both in accuracy and efficiency (ie. FourCastNet)
+- monitoring using edge devices, tinyml → strong privacy guarantees
+- geospacial ai
+
+*challenges*
+
+- lack of data, inaccurate data
+- insufficient data, not enough historical data
+- lack of interpretability