Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark backprop of a te.TransformerLayer. #2956

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

wujingyue
Copy link
Collaborator

@wujingyue wujingyue commented Sep 18, 2024

$ mpirun -np 1 pytest tests/python/test_transformer_engine.py --only-mpi

------------------------------------------------------------------------------------------------- benchmark: 2 tests ------------------------------------------------------------------------------------------------
Name (time in us)                           Min                   Max                  Mean              StdDev                Median                 IQR            Outliers       OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_transformer_layer[forward]        952.7383 (1.0)      1,200.1079 (1.0)      1,016.5755 (1.0)      103.4595 (1.0)        976.7041 (1.0)       79.6057 (1.0)           1;1  983.6948 (1.0)           5           1
test_transformer_layer[backward]     1,070.4808 (1.12)     1,347.0454 (1.12)     1,188.3112 (1.17)     136.6049 (1.32)     1,125.9178 (1.15)     257.2080 (3.23)          1;0  841.5304 (0.86)          5           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

@wujingyue
Copy link
Collaborator Author

!build

@wujingyue
Copy link
Collaborator Author

!build

Base automatically changed from wjy/te to main September 18, 2024 18:56
@wujingyue wujingyue changed the base branch from main to wjy/fix September 18, 2024 22:25
@wujingyue
Copy link
Collaborator Author

!build

Base automatically changed from wjy/fix to multidevice_typo_fix September 18, 2024 22:50
Base automatically changed from multidevice_typo_fix to main September 19, 2024 01:27
@wujingyue
Copy link
Collaborator Author

I'm getting a weird error from jit_python_distributed_tests_17_A100 based on this PR. I'll fix that before merging this PR. I've yet to reproduce that locally unfortunately.

@wujingyue
Copy link
Collaborator Author

I believe the error has something to do with calling init_process_group and destroy_process_group for each test. Due to race conditions, some rank calls init_process_group from the second test before all ranks finished destroying the default process group.

@cowanmeg
Copy link
Collaborator

How about we don't add this to CI for now? We can benchmark locally and then work on fixing the CI problems

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants