-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Internal Server Error encountered during training process #239
Comments
We are always experiencing this error after the model has been running for enough hours(Almost as always after about 4 hours after the sl is stuck at waiting for merging), causing us to never finish the training. From our sentinel node: For one of the other nodes we are getting the same error. |
We just had another wrong with the same setup. A different error message pops up but with similar behavior. The training has been going well for 4 hours but one of the nodes had a failure that stopped the training process completely. I have checked their env and the network is completely fine, and no other tasks is running at the same time. Logs form sentinel node: Logs from the failing node: |
We are facing the same problem again even by extending the SL failure timeout to long enough. One of the nodes just got frozen during the merging process and the other nodes are receiving this error: I am attaching the logs here: Other node: In addition, we have also been experiencing a similar node frozen problem before. With an older SL version below 2.0.0, with fastai platform and a very small model. |
Please add SL container logs. Attached Logs are ML container related. |
logs from the failing node: |
@Ultimate-Storm Can you please re-upload logs? I am getting 404 while downloading attached logs. |
Issue description
issue description: We encountered an "Internal Server Error" with 3 nodes joint training. The training process has successfully gone through 21 epochs and around 20 merge rounds, but the error message came in.
occurrence - consistent or rare: consistent
error messages:
2024-02-24 23:04:51,605 : SwarmCallback : INFO : Starting Swarm merging round ...
2024-02-25 04:16:08,067 : SwarmCallback : ERROR : Sync Swarm call to SL container failed - SL error: (500)
Reason: INTERNAL SERVER ERROR
HTTP response headers: HTTPHeaderDict({'Server': 'TwistedWeb/21.7.0', 'Date': 'Sun, 25 Feb 2024 04:16:04 GMT', 'Content-Type': 'application/problem+json', 'Content-Length': '251'})
HTTP response body: {
"detail": "The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.",
"status": 500,
"title": "Internal Server Error",
"type": "about:blank"
}
commands used for starting containers:
docker logs [APLS, SPIRE, SN, SL, SWCI]:
Run the SN container
sudo $script_dir/../../swarm_learning_scripts/run-sn
-d --rm
--name=sn_node
--network=host-net
--host-ip="$ip_addr"
"$sn_command"
--sn-p2p-port=30303
--sn-api-port=30304
--key=cert/sn-"$host_index"-key.pem
--cert=cert/sn-"$host_index"-cert.pem
--capath=cert/ca/capath
--apls-ip="$sentinel" \
Run the SWOP container
sudo $script_dir/../../swarm_learning_scripts/run-swop --rm -d
--name=swop"$ip_addr"
--network=host-net
--sn-ip="$sentinel"
--sn-api-port=30304
--usr-dir=workspace/"$workspace"/swop
--profile-file-name=swop_profile_"$ip_addr".yaml
--key=cert/swop-"$host_index"-key.pem
--cert=cert/swop-"$host_index"-cert.pem
--capath=cert/ca/capath
-e http_proxy= -e https_proxy=
--apls-ip="$sentinel"
-e SWOP_KEEP_CONTAINERS=True
Start the SWCI container
sudo "$script_dir/../../swarm_learning_scripts/run-swci"
-d --rm --name="swci-$ip_addr"
--network="host-net"
--usr-dir="workspace/$workspace/swci"
--init-script-name="swci-init"
--key="cert/swci-$host_index-key.pem"
--cert="cert/swci-$host_index-cert.pem"
--capath="cert/ca/capath"
-e "http_proxy=" -e "https_proxy=" --apls-ip="$sentinel"
-e "SWCI_RUN_TASK_MAX_WAIT_TIME=5000"
-e "SWCI_GENERIC_TASK_MAX_WAIT_TIME=5000"
Swarm Learning Version:
2.2.0
OS and ML Platform
3 machines, all of them running SN and SWCI nodes. We are hosting SWCI node
Quick Checklist: Respond [Yes/No]
Additional notes
NOTE: Create an archive with supporting artifacts and attach to issue, whenever applicable.
The text was updated successfully, but these errors were encountered: