Skip to content

Commit 947467c

Browse files
authored
Add files via upload
1 parent c4f1b3a commit 947467c

File tree

2 files changed

+160
-0
lines changed

2 files changed

+160
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
> Samsi, S., Zhao, D., McDonald, J., Li, B., Michaleas, A., Jones, M., ... & Gadepally, V. (2023, September). From words to watts: Benchmarking the energy costs of large language model inference. In 2023 IEEE High Performance Extreme Computing Conference (HPEC) (pp. 1-9). IEEE. – https://arxiv.org/pdf/2310.03003
2+
3+
- inference has been less optimized than training
4+
- but inference calls happen more frequently
5+
- paper benchmarks inference energy costs of
6+
7+
*setup*
8+
9+
- model: meta llama (decoder only, transformer based)
10+
- 7b, 13b, 65b (largest model)
11+
- batchsize=64
12+
- maxtokens=128
13+
- datasets: alpaca, gsm8k
14+
- 4,096 samples
15+
- multi gpu model sharding, <32 gpus
16+
- pytorch fairscale
17+
- $\tau$=0.8, top-p=0.95 (common values, no tuning)
18+
- no quantization
19+
- MIT supercloud hpc system:
20+
- 448 compute nodes
21+
- xeon cpu
22+
- 2x v100 35gb gpu (250w) → for 8, 16, 32 shards
23+
- 4x a100 80gb gpu (300w) → for smaller shards
24+
- maximum power draw capped at 250w
25+
- omnipath, 25gb ethernet
26+
- metrics:
27+
- perf, latency, energy cost (but not correctness/quality)
28+
- total energy consumption divided by num of nodes (not fine granular)
29+
30+
*A1 - fig 2: tokens/s vs. gpu vs. model size vs. dataset*
31+
32+
- a100 >> v100 (by 1x-2x)
33+
- gsm8k dataset seems easier
34+
- llama7b >> llama65b (by 3x-5x)
35+
36+
*A2 - fig 3: energy/s vs. gpu vs. model size vs. dataset*
37+
38+
- joul per second
39+
- a100 >> v100
40+
- both datasets use the same energy
41+
- llama7b >> llama65b (exponentially more, disproportionate to increase in performance)
42+
43+
*B - fig 4, 5: shards vs. batch size vs. model size vs. dataset*
44+
45+
- more batches don't need more energy per token
46+
- more shards need more energy (circa proportional to batch size increase)
47+
- llama65b ranges 300w-1000w
48+
49+
*C - fig 6, 7: shards vs. token size vs. energy vs. dataset*
50+
51+
- llama65b only
52+
- max generation length doesn't matter
53+
- there is a sweet spot where energy per token generated drops with increasing batch size
54+
55+
*D - fig 8, 9: shards vs. token size vs. energy vs. dataset*
56+
57+
- gsm8k, 512 generation length, 16 shards, max batch size → best energy efficiency
58+
59+
*E - power cap*
60+
61+
- capping from 250w to 175w: 6.7% slower inference
62+
- capping from 175w to 150w: 19.49% slower inference
63+
- static power cap not recommended
64+
65+
*F - gpu utilization in distributed inference*
66+
67+
- 94-98% $\pm$ 23-27% utilization
68+
- higher with longer token size
69+
70+
*conclusion*
71+
72+
- power capping can be an effective tool for reducing inference energy
73+
74+
*outlook*
75+
76+
- hyperparam search
77+
- a single gpu could be shared by multiple models, with minimal degradation
78+
- model quantization, distillation, sparsification
79+
- custom, energy-efficient hardware
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# performance engineering
2+
3+
*goal*
4+
5+
- memory and compute efficiency, at training and inference
6+
- improved performance, cost, sustainability, …
7+
- lower resource consumption = power, carbon footprint, co2 emission, water usage, …
8+
9+
*metrics*
10+
11+
- datacenter power usage efficiency (pue)
12+
- = total energy use (also cooling) / computing energy use
13+
- datacenter efficiency
14+
- 1 means no overhead, 1.67 on average
15+
- datacenter carbon intensity
16+
- = tCO₂e/MWh
17+
- = tons of carbon dioxide and all equivalent greenhouse gasses / one million watt used for 1h
18+
- cleanliness of datacenter energy
19+
- efficiency
20+
- = flops / joule
21+
- = floating point ops per second / watt
22+
- training energy consumption
23+
- = $\text{pue} \cdot t \cdot e$
24+
- where $t$ is the train time, $e$ is the compute energy consumption
25+
- for federated learning: sum across all rounds and all devices used in that round
26+
- for network routers: consumption depends on mb up/down
27+
28+
*overview*
29+
30+
- using renewable energy:
31+
- moving datacenters closer to water
32+
- reducing power consumption:
33+
- sparse models (instead of dense models)
34+
- zero shot models
35+
- improving efficiency:
36+
- model optimization
37+
- infrastructure optimization (accelerator hardware, cloud, edge → more optimized than on-prem)
38+
39+
*inference optimization*
40+
41+
- just as important as training: less compute but many invocations
42+
- quantization
43+
- lower precision of weights
44+
- pre-trainin vs. post-training
45+
- pruning
46+
- drops unnecessary nodes and edges (structured) vs. weights (unstructured)
47+
- distillation
48+
- student learns logits from teacher model, instead of labels
49+
- cascading
50+
- chaining models, in increasing complexity
51+
- context-aware model selection
52+
- choosing specialized model at runtime
53+
- self-adaptive system
54+
- automatically selecting model and infrastructure, based on traffic
55+
56+
*training optimization*
57+
58+
- federated learning
59+
- utilizing otherwise idle devices
60+
- inefficient: high synchronization overhead, has multiple rounds, …
61+
- edge computing
62+
- batches
63+
- setting gpu power limit for gpus
64+
- trades off with training-time
65+
- energy bloat = finishing training faster than necessary, at higher energy cost
66+
- intrinsic bloat = gpus with less work are finishing too fast
67+
- extrinsic bloat = ie. hardware failure (straggler) limits throughput, other gpus are finishing too fast
68+
69+
# ml4climate
70+
71+
- using models to help with climate change
72+
- sustainability development goals (sdg) by united nations
73+
- decision making, forecasting, physics simulations, analytical models → can outperform traditional models both in accuracy and efficiency (ie. FourCastNet)
74+
- monitoring using edge devices, tinyml → strong privacy guarantees
75+
- geospacial ai
76+
77+
*challenges*
78+
79+
- lack of data, inaccurate data
80+
- insufficient data, not enough historical data
81+
- lack of interpretability

0 commit comments

Comments
 (0)