Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add automatic triage utility #793

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Add automatic triage utility #793

wants to merge 1 commit into from

Conversation

olupton
Copy link
Contributor

@olupton olupton commented May 2, 2024

This is a small Python program that implements a two-stage bisection, identifying a commit in JAX or XLA that caused a test case to start failing.

An example use-case was the JetTest.test_dot test, which had been failing in the JAX unit tests on A100 for an extended period.

$ .github/triage/triage --container=jax test-jax.sh jet_test_gpu
[INFO] 2024-05-01 06:24:23 Checking end-of-range failure in 2024-04-30
[INFO] 2024-05-01 06:25:13 Ran test case in 2024-04-30 in 48.6s
[INFO] 2024-05-01 06:25:14 Starting coarse search with 2024-04-29 based on --end-date=None and end_date=2024-04-30
[INFO] 2024-05-01 06:26:07 Ran test case in 2024-04-29 in 52.4s
[INFO] 2024-05-01 06:26:58 Ran test case in 2024-04-28 in 49.8s
[INFO] 2024-05-01 06:27:52 Ran test case in 2024-04-26 in 51.4s
[INFO] 2024-05-01 06:28:50 Ran test case in 2024-04-22 in 54.7s
[INFO] 2024-05-01 06:29:44 Ran test case in 2024-04-14 in 51.4s
[INFO] 2024-05-01 06:30:36 Ran test case in 2024-03-29 in 49.7s
[INFO] 2024-05-01 06:31:31 Ran test case in 2024-02-26 in 52.5s
[INFO] 2024-05-01 06:32:40 Ran test case in 2023-12-24 in 68.0s
[INFO] 2024-05-01 06:34:23 Ran test case in 2023-08-18 in 101.6s
[INFO] 2024-05-01 06:34:23 Coarse container-level search yielded [2023-08-18, 2023-12-24]...
[INFO] 2024-05-01 06:35:28 Ran test case in 2023-10-20 in 61.4s
[INFO] 2024-05-01 06:35:28 Refined container-level range to [2023-10-20, 2023-12-24]
[INFO] 2024-05-01 06:36:32 Ran test case in 2023-11-21 in 63.4s
[INFO] 2024-05-01 06:36:32 Refined container-level range to [2023-10-20, 2023-11-21]
[INFO] 2024-05-01 06:37:38 Ran test case in 2023-11-03 in 61.5s
[INFO] 2024-05-01 06:37:38 Refined container-level range to [2023-11-03, 2023-11-21]
[INFO] 2024-05-01 06:38:45 Ran test case in 2023-11-12 in 66.4s
[INFO] 2024-05-01 06:38:45 Refined container-level range to [2023-11-03, 2023-11-12]
[INFO] 2024-05-01 06:39:49 Ran test case in 2023-11-09 in 59.5s
[INFO] 2024-05-01 06:39:49 Refined container-level range to [2023-11-03, 2023-11-09]
[INFO] 2024-05-01 06:39:52 Could not adjust 2023-11-06 00:00:00 given before=2023-11-09 and after=2023-11-03
[INFO] 2024-05-01 06:39:53 Bisecting JAX [db07f402333997d5d570a2d0478f9a15f79bf8b2, 9b1572c02859287294dec7a6b3df9018f462c8aa] and XLA [049a3e6caf3f60b7edcced4c05f579bf2c3f26dc, 0bc02d71f75ce5a4ea93ade1957a79c3227ce466] using ghcr.io/nvidia/jax:nightly-2023-11-09
[INFO] 2024-05-01 06:39:53 Building in the range-ending 2023-11-09 container...
[INFO] 2024-05-01 06:39:54 Checking out XLA 0bc02d71f75ce5a4ea93ade1957a79c3227ce466 JAX 9b1572c02859287294dec7a6b3df9018f462c8aa
[INFO] 2024-05-01 06:41:14 Build completed in 80.0s
[INFO] 2024-05-01 06:42:14 Test completed in 60.4s
[INFO] 2024-05-01 06:42:14 Verified test failure after rebuilding in 2023-11-09
[INFO] 2024-05-01 06:42:14 Checking out XLA 049a3e6caf3f60b7edcced4c05f579bf2c3f26dc JAX db07f402333997d5d570a2d0478f9a15f79bf8b2
[INFO] 2024-05-01 06:43:14 Build completed in 59.3s
[INFO] 2024-05-01 06:44:10 Test completed in 56.0s
[INFO] 2024-05-01 06:44:10 Test passes after rebuilding commits from 2023-11-03 in 2023-11-09
[INFO] 2024-05-01 06:44:10 Checking out XLA adee86839deffae1f789b483178eaf6454d0bd9b JAX 5a1731c16fd62ad1b6c3fbd059a65332949f4f15
[INFO] 2024-05-01 06:45:07 Build completed in 56.6s
[INFO] 2024-05-01 06:45:43 Test completed in 36.3s
[INFO] 2024-05-01 06:45:43 Checking out XLA 27c69deb22cafafb1226c6ed027d5917ef29f538 JAX 1c1dd7c8c7ff2e5790159d9cdbbbb1f029a92d4b
[INFO] 2024-05-01 06:46:37 Build completed in 53.7s
[INFO] 2024-05-01 06:47:28 Test completed in 51.6s
[INFO] 2024-05-01 06:47:29 Checking out XLA d257360bcc1ebc3cdca939baebfe8c48134e2b6a JAX 390022a227f7271fc0caebed3c4c066a667b8628
[INFO] 2024-05-01 06:48:22 Build completed in 53.4s
[INFO] 2024-05-01 06:48:59 Test completed in 37.5s
[INFO] 2024-05-01 06:49:00 Checking out XLA 75bd8cc70166bb532c5e7d98d929c2004b32b540 JAX dda76733e835788c611b0687f0477e469a92881b
[INFO] 2024-05-01 06:49:29 Build completed in 29.3s
[INFO] 2024-05-01 06:50:11 Test completed in 42.6s
[INFO] 2024-05-01 06:50:12 Checking out XLA e112baca8630aca294f996a5d95028e59ade56e5 JAX 1e810983fa7331a1ff18941e484b2a732c785e9d
[INFO] 2024-05-01 06:50:40 Build completed in 28.1s
[INFO] 2024-05-01 06:51:18 Test completed in 38.4s
[INFO] 2024-05-01 06:51:18 Checking out XLA 7fba2ad0bc4c21f8d62f09f07876ce2616c73349 JAX bfbf9e1c3313cae2dfd71e39e64c50c8f42017ef
[INFO] 2024-05-01 06:51:44 Build completed in 25.7s
[INFO] 2024-05-01 06:52:22 Test completed in 37.8s
[INFO] 2024-05-01 06:52:22 Checking out XLA a58070090a025db5cbdd4b596f9ef714fa13641a JAX 1126945da8e3d60b2ca670b78957ca38e92d23d6
[INFO] 2024-05-01 06:52:48 Build completed in 26.1s
[INFO] 2024-05-01 06:53:26 Test completed in 37.9s
[INFO] 2024-05-01 06:53:26 Checking out XLA 7fba2ad0bc4c21f8d62f09f07876ce2616c73349 JAX 1126945da8e3d60b2ca670b78957ca38e92d23d6
[INFO] 2024-05-01 06:53:52 Build completed in 25.9s
[INFO] 2024-05-01 06:54:30 Test completed in 37.9s
[INFO] 2024-05-01 06:54:30 Bisected failure to XLA 7fba2ad0bc4c21f8d62f09f07876ce2616c73349..a58070090a025db5cbdd4b596f9ef714fa13641a with JAX 1126945da8e3d60b2ca670b78957ca38e92d23d6

pointing the finger at openxla/xla@a580700. The failure is worked around in jax-ml/jax#21035.

This is a small Python program that implements a two-stage bisection,
identifying a commit in JAX or XLA that caused a test case to start
failing.
@olupton olupton requested a review from DwarKapex May 3, 2024 12:19
@olupton olupton requested a review from gspschmid May 14, 2024 07:16
@olupton
Copy link
Contributor Author

olupton commented May 14, 2024

A different test case: .github/triage/triage --container jax --bazel-cache ... test-jax.sh linear_search_test_gpu points to openxla/xla@a3f9a7e for https://github.com/NVIDIA/JAX-Toolbox/actions/runs/9060572476/job/24891187791#step:7:772

[INFO] 2024-05-13 23:15:24 Checking end-of-range failure in 2024-05-13
[INFO] 2024-05-13 23:16:15 Ran test case in 2024-05-13 in 47.5s
[INFO] 2024-05-13 23:17:05 Starting coarse search with 2024-05-12 based on --end-date=None and end_date=2024-05-13
[INFO] 2024-05-13 23:17:52 Ran test case in 2024-05-12 in 42.6s
[INFO] 2024-05-13 23:19:25 Ran test case in 2024-05-11 in 43.2s
[INFO] 2024-05-13 23:21:05 Ran test case in 2024-05-09 in 45.9s
[INFO] 2024-05-13 23:21:05 Coarse container-level search yielded [2024-05-09, 2024-05-11]...
[INFO] 2024-05-13 23:22:46 Ran test case in 2024-05-10 in 46.9s
[INFO] 2024-05-13 23:22:46 Refined container-level range to [2024-05-10, 2024-05-11]
[INFO] 2024-05-13 23:22:47 Bisecting JAX [f21e3e82c78454ef83975fc3964998b48451519f, 3b03e5497d70a7d0745f1bd421e4cb4f9dd56ebd] and XLA [9b6f2cbb95482d7c69ec56125ebf392b8a2faad1, 93fe4f0d20fa1ad29fee664f7842d7e427dc6cf1] using ghcr.io/nvidia
/jax:jax-2024-05-11
[INFO] 2024-05-13 23:22:47 Building in the range-ending 2024-05-11 container...
[INFO] 2024-05-13 23:22:48 Checking out XLA 93fe4f0d20fa1ad29fee664f7842d7e427dc6cf1 JAX 3b03e5497d70a7d0745f1bd421e4cb4f9dd56ebd
[INFO] 2024-05-13 23:35:41 Build completed in 773.3s
[INFO] 2024-05-13 23:36:06 Test completed in 24.9s
[INFO] 2024-05-13 23:36:06 Verified test failure after rebuilding in 2024-05-11
[INFO] 2024-05-13 23:36:06 Checking out XLA 9b6f2cbb95482d7c69ec56125ebf392b8a2faad1 JAX f21e3e82c78454ef83975fc3964998b48451519f
[INFO] 2024-05-13 23:45:28 Build completed in 562.5s
[INFO] 2024-05-13 23:45:50 Test completed in 21.9s
[INFO] 2024-05-13 23:45:50 Test passes after rebuilding commits from 2024-05-10 in 2024-05-11
[INFO] 2024-05-13 23:45:50 Checking out XLA cd35e5c5c1cffcdf3e914a94141d42163e215951 JAX 0a3e4327451a1825047cb883813ceb6c1b36f5ac
[INFO] 2024-05-13 23:49:19 Build completed in 208.5s
[INFO] 2024-05-13 23:49:51 Test completed in 31.8s
[INFO] 2024-05-13 23:49:51 Checking out XLA a3f9a7e68b82df174fee22b80a6ec1185ebca719 JAX 17444fc8fab32426f5ff8bbbe7be6507ec1641ea
[INFO] 2024-05-13 23:53:12 Build completed in 200.9s
[INFO] 2024-05-13 23:53:41 Test completed in 29.0s
[INFO] 2024-05-13 23:53:41 Checking out XLA 48f4ca5dd3b0900acac5f645f0f0911efe42350c JAX c231cd51eb074554b4c2abd115caa00f8bba3665
[INFO] 2024-05-13 23:54:55 Build completed in 74.4s
[INFO] 2024-05-13 23:55:27 Test completed in 31.5s
[INFO] 2024-05-13 23:55:27 Checking out XLA 4f6e8d670df87c114f7dc5e210b3846e25ef7932 JAX 17444fc8fab32426f5ff8bbbe7be6507ec1641ea
[INFO] 2024-05-13 23:56:16 Build completed in 48.8s
[INFO] 2024-05-13 23:56:47 Test completed in 31.3s
[INFO] 2024-05-13 23:56:47 Checking out XLA f9258de517de7ddfa912ac629eea2b827cc20575 JAX 17444fc8fab32426f5ff8bbbe7be6507ec1641ea
[INFO] 2024-05-13 23:57:47 Build completed in 59.7s
[INFO] 2024-05-13 23:58:18 Test completed in 31.3s
[INFO] 2024-05-13 23:58:18 Checking out XLA f9258de517de7ddfa912ac629eea2b827cc20575 JAX 17444fc8fab32426f5ff8bbbe7be6507ec1641ea
[INFO] 2024-05-13 23:58:53 Build completed in 34.2s
[INFO] 2024-05-13 23:59:24 Test completed in 31.4s
[INFO] 2024-05-13 23:59:24 Bisected failure to XLA f9258de517de7ddfa912ac629eea2b827cc20575..a3f9a7e68b82df174fee22b80a6ec1185ebca719 with JAX 17444fc8fab32426f5ff8bbbe7be6507ec1641ea
xla f9258de517de7ddfa912ac629eea2b827cc20575 a3f9a7e68b82df174fee22b80a6ec1185ebca719 jax 17444fc8fab32426f5ff8bbbe7be6507ec1641ea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant