|
| 1 | +> Samsi, S., Zhao, D., McDonald, J., Li, B., Michaleas, A., Jones, M., ... & Gadepally, V. (2023, September). From words to watts: Benchmarking the energy costs of large language model inference. In 2023 IEEE High Performance Extreme Computing Conference (HPEC) (pp. 1-9). IEEE. – https://arxiv.org/pdf/2310.03003 |
| 2 | +
|
| 3 | +- inference has been less optimized than training |
| 4 | +- but inference calls happen more frequently |
| 5 | +- paper benchmarks inference energy costs of |
| 6 | + |
| 7 | +*setup* |
| 8 | + |
| 9 | +- model: meta llama (decoder only, transformer based) |
| 10 | + - 7b, 13b, 65b (largest model) |
| 11 | + - batchsize=64 |
| 12 | + - maxtokens=128 |
| 13 | +- datasets: alpaca, gsm8k |
| 14 | + - 4,096 samples |
| 15 | +- multi gpu model sharding, <32 gpus |
| 16 | + - pytorch fairscale |
| 17 | + - $\tau$=0.8, top-p=0.95 (common values, no tuning) |
| 18 | + - no quantization |
| 19 | +- MIT supercloud hpc system: |
| 20 | + - 448 compute nodes |
| 21 | + - xeon cpu |
| 22 | + - 2x v100 35gb gpu (250w) → for 8, 16, 32 shards |
| 23 | + - 4x a100 80gb gpu (300w) → for smaller shards |
| 24 | + - maximum power draw capped at 250w |
| 25 | + - omnipath, 25gb ethernet |
| 26 | +- metrics: |
| 27 | + - perf, latency, energy cost (but not correctness/quality) |
| 28 | + - total energy consumption divided by num of nodes (not fine granular) |
| 29 | + |
| 30 | +*A1 - fig 2: tokens/s vs. gpu vs. model size vs. dataset* |
| 31 | + |
| 32 | +- a100 >> v100 (by 1x-2x) |
| 33 | +- gsm8k dataset seems easier |
| 34 | +- llama7b >> llama65b (by 3x-5x) |
| 35 | + |
| 36 | +*A2 - fig 3: energy/s vs. gpu vs. model size vs. dataset* |
| 37 | + |
| 38 | +- joul per second |
| 39 | +- a100 >> v100 |
| 40 | +- both datasets use the same energy |
| 41 | +- llama7b >> llama65b (exponentially more, disproportionate to increase in performance) |
| 42 | + |
| 43 | +*B - fig 4, 5: shards vs. batch size vs. model size vs. dataset* |
| 44 | + |
| 45 | +- more batches don't need more energy per token |
| 46 | +- more shards need more energy (circa proportional to batch size increase) |
| 47 | +- llama65b ranges 300w-1000w |
| 48 | + |
| 49 | +*C - fig 6, 7: shards vs. token size vs. energy vs. dataset* |
| 50 | + |
| 51 | +- llama65b only |
| 52 | +- max generation length doesn't matter |
| 53 | +- there is a sweet spot where energy per token generated drops with increasing batch size |
| 54 | + |
| 55 | +*D - fig 8, 9: shards vs. token size vs. energy vs. dataset* |
| 56 | + |
| 57 | +- gsm8k, 512 generation length, 16 shards, max batch size → best energy efficiency |
| 58 | + |
| 59 | +*E - power cap* |
| 60 | + |
| 61 | +- capping from 250w to 175w: 6.7% slower inference |
| 62 | +- capping from 175w to 150w: 19.49% slower inference |
| 63 | +- static power cap not recommended |
| 64 | + |
| 65 | +*F - gpu utilization in distributed inference* |
| 66 | + |
| 67 | +- 94-98% $\pm$ 23-27% utilization |
| 68 | +- higher with longer token size |
| 69 | + |
| 70 | +*conclusion* |
| 71 | + |
| 72 | +- power capping can be an effective tool for reducing inference energy |
| 73 | + |
| 74 | +*outlook* |
| 75 | + |
| 76 | +- hyperparam search |
| 77 | +- a single gpu could be shared by multiple models, with minimal degradation |
| 78 | +- model quantization, distillation, sparsification |
| 79 | +- custom, energy-efficient hardware |
0 commit comments