Name		Name	Last commit message	Last commit date
parent directory ..
modeling		modeling
README.md		README.md
benchmark_ait.py		benchmark_ait.py
benchmark_mi250.sh		benchmark_mi250.sh
benchmark_pt.py		benchmark_pt.py
test_correctness.py		test_correctness.py
verification.py		verification.py
weight_utils.py		weight_utils.py

README.md

Vision Transformer (VIT)

In this example, we will demo how to lower a pretrained Vision Transformer from TIMM, and run inference in AITemplate. We tested on two variants of Vision Transformer: Base version with 224x224 input / patch 16, and Large version with 384x384 input / patch 16.

Code structure

modeling
    vision_transformer.py    # VIT definition using AIT's frontend API
weight_utils.py              # Utils to convert TIMM VIT weights to AIT
verification.py              # Numerical verification between TIMM and AIT
benchmark_pt.py              # Benchmark code for PyTorch
benchmark_ait.py             # Benchmark code for AITemplate

Reference Speed vs PyTorch Eager

A100-40GB / CUDA 11.6.2

PT = PyTorch 1.12 Eager

vit_base_patch16_224

Batch size	PT Latency (ms)	PT QPS (im/s)	AIT Latency (ms)	AIT QPS (im/s)
1	4.95	202.15	1.02	979.31
2	5.26	380.43	1.15	1735.64
4	5.51	726.08	1.57	2543.72
8	5.56	1439.03	2.20	3642.16
16	8.59	1863.35	3.64	4396.74
32	15.95	2006.62	6.51	4916.93
64	31.48	2032.77	12.67	5052.52
128	59.86	2138.35	25.10	5099.77
256	115.00	2226.10	48.55	5273.03

vit_large_patch16_384

Batch size	PT Latency (ms)	PT QPS (im/s)	AIT Latency (ms)	AIT QPS (im/s)
1	9.88	101.17	3.84	260.21
2	11.90	168.02	5.87	340.98
4	21.20	188.66	11.49	348.09
8	39.33	203.43	19.09	419.07
16	76.00	210.54	36.19	442.08
32	147.24	217.33	70.03	456.93
64	291.00	219.93	135.25	473.21
128	578.99	221.08	267.09	479.24
256	1204.16	212.60	538.97	474.98

MI-250 / ROCm 5.2.3 / HIPCC-10736

PT = PyTorch 1.12 Eager

1 GCD

vit_base_patch16_224

Batch size	PT Latency (ms)	PT QPS (im/s)	AIT Latency (ms)	AIT QPS (im/s)
1	3.54	282.12	3.49	286.26
2	4.43	451.73	3.78	528.84
4	6.09	657.02	4.05	986.95
8	9.65	829.27	5.31	1507.06
16	16.62	962.98	8.50	1882.72
32	29.87	1071.25	14.43	2218.07
64	56.58	1131.08	26.52	2413.45
128	110.28	1160.73	51.62	2479.69
256	217.07	1179.35	102.82	2489.89

vit_large_patch16_384

Batch size	PT Latency (ms)	PT QPS (im/s)	AIT Latency (ms)	AIT QPS (im/s)
1	12.90	77.51	9.70	103.05
2	22.42	89.19	13.40	149.29
4	38.16	104.83	22.12	180.86
8	70.58	113.35	38.46	208.00
16	136.28	117.40	70.44	227.15
32	261.97	122.15	138.14	231.65
64	541.90	118.10	270.01	237.02
128	1108.36	115.49	534.97	239.27
256	2213.09	115.68	1063.24	240.77

2 GCDs

vit_base_patch16_224

Batch size	PT Latency (ms)	PT QPS (im/s)	AIT Latency (ms)	AIT QPS (im/s)
1
2	3.49	572.95	3.59	556.55
4	4.11	974.26	3.97	1006.80
8	5.88	1359.64	4.23	1889.44
16	9.75	1641.06	5.71	2800.69
32	17.55	1823.03	9.34	3426.32
64	31.31	2043.79	16.24	3940.53
128	60.33	2121.64	30.97	4133.14
256	117.96	2170.29	59.82	4279.21

vit_large_patch16_384

Batch size	PT Latency (ms)	PT QPS (im/s)	AIT Latency (ms)	AIT QPS (im/s)
1
2	12.73	157.07	10.52	190.13
4	22.97	174.12	14.94	267.82
8	39.78	201.08	24.55	325.85
16	74.95	213.48	43.95	364.07
32	146.18	218.91	82.04	390.06
64	283.04	226.12	162.62	393.55
128	583.03	219.54	313.34	408.51
256	1197.56	213.77	621.71	411.77

Note for Performance Results

For NVIDIA A100, our test cluster doesn't allow to lock frequency. We make warm up longer to collect more stable results, but it is expected to have small variance to the results with locked frequency.
To benchmark MI-250, the first step is to run python3 benchmark_ait.py to generate all necessary model dynamic library files with single GCD. Then run ./benchmark_mi250.sh {batch_size} to simulate data parallel execution on 2 GCDs, each GCD is processing half of the batch.
To benchmark MI-250 1 GCD, we lock the frequency with command rocm-smi -d x --setperfdeterminism 1700, where x is the GPU id.
To benchmark MI-250 2 GCDs, we observed performance regression with rocm perf-determ mode. The 2 GCDs number is running without perf-determ mode set with command rocm-smi -d x --resetperfdeterminism, where x is the GPU id.
PyTorch Eager result doesn't reflect BetterTransformer, mainly due to BetterTransformer integration to TIMM/Transformer package is not yet landed.
Performance results are what we can reproduce. It should not be used for other purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

04_vit

04_vit

README.md

Vision Transformer (VIT)

Code structure

Reference Speed vs PyTorch Eager

A100-40GB / CUDA 11.6.2

MI-250 / ROCm 5.2.3 / HIPCC-10736

1 GCD

2 GCDs

Note for Performance Results

Files

04_vit

Directory actions

More options

Directory actions

More options

Latest commit

History

04_vit

Folders and files

parent directory

README.md

Vision Transformer (VIT)

Code structure

Reference Speed vs PyTorch Eager

A100-40GB / CUDA 11.6.2

MI-250 / ROCm 5.2.3 / HIPCC-10736

1 GCD

2 GCDs

Note for Performance Results