Internal Server Error encountered during training process #239

Ultimate-Storm · 2024-02-26T09:36:49Z

Issue description

issue description: We encountered an "Internal Server Error" with 3 nodes joint training. The training process has successfully gone through 21 epochs and around 20 merge rounds, but the error message came in.
occurrence - consistent or rare: consistent
error messages:
2024-02-24 23:04:51,605 : SwarmCallback : INFO : Starting Swarm merging round ...
2024-02-25 04:16:08,067 : SwarmCallback : ERROR : Sync Swarm call to SL container failed - SL error: (500)
Reason: INTERNAL SERVER ERROR
HTTP response headers: HTTPHeaderDict({'Server': 'TwistedWeb/21.7.0', 'Date': 'Sun, 25 Feb 2024 04:16:04 GMT', 'Content-Type': 'application/problem+json', 'Content-Length': '251'})
HTTP response body: {
"detail": "The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.",
"status": 500,
"title": "Internal Server Error",
"type": "about:blank"
}
commands used for starting containers:
docker logs [APLS, SPIRE, SN, SL, SWCI]:

Run the SN container

sudo $script_dir/../../swarm_learning_scripts/run-sn
-d --rm
--name=sn_node
--network=host-net
--host-ip="$ip_addr"
"$sn_command"
--sn-p2p-port=30303
--sn-api-port=30304
--key=cert/sn-"$host_index"-key.pem
--cert=cert/sn-"$host_index"-cert.pem
--capath=cert/ca/capath
--apls-ip="$sentinel" \

Run the SWOP container

sudo $script_dir/../../swarm_learning_scripts/run-swop --rm -d
--name=swop"$ip_addr"
--network=host-net
--sn-ip="$sentinel"
--sn-api-port=30304
--usr-dir=workspace/"$workspace"/swop
--profile-file-name=swop_profile_"$ip_addr".yaml
--key=cert/swop-"$host_index"-key.pem
--cert=cert/swop-"$host_index"-cert.pem
--capath=cert/ca/capath
-e http_proxy= -e https_proxy=
--apls-ip="$sentinel"
-e SWOP_KEEP_CONTAINERS=True

Start the SWCI container

sudo "$script_dir/../../swarm_learning_scripts/run-swci"
-d --rm --name="swci-$ip_addr"
--network="host-net"
--usr-dir="workspace/$workspace/swci"
--init-script-name="swci-init"
--key="cert/swci-$host_index-key.pem"
--cert="cert/swci-$host_index-cert.pem"
--capath="cert/ca/capath"
-e "http_proxy=" -e "https_proxy=" --apls-ip="$sentinel"
-e "SWCI_RUN_TASK_MAX_WAIT_TIME=5000"
-e "SWCI_GENERIC_TASK_MAX_WAIT_TIME=5000"

Swarm Learning Version:

Find the docker tag of the Swarm images ( $ docker images | grep hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning )
2.2.0

OS and ML Platform

details of host OS: Ubuntu 22.04.4 LTS
details of ML platform used: Quadro RTX 6000
details of Swarm learning Cluster (Number of machines, SL nodes, SN nodes):
3 machines, all of them running SN and SWCI nodes. We are hosting SWCI node

Quick Checklist: Respond [Yes/No]

APLS server web GUI shows available Licenses? /
If Multiple systems are used, can each system access every other system? Yes
Is Password-less SSH configuration setup for all the systems?
If GPU or other protected resources are used, does the account have sufficient privileges to access and use them?
Is the user id a member of the docker group?

Additional notes

Are you running documented example without any modification? Yes
Add any additional information about use case or any notes which supports for issue investigation:

NOTE: Create an archive with supporting artifacts and attach to issue, whenever applicable.

Ultimate-Storm · 2024-02-26T09:40:58Z

We are always experiencing this error after the model has been running for enough hours(Almost as always after about 4 hours after the sl is stuck at waiting for merging), causing us to never finish the training.

From our sentinel node:
swop.log
sn.log
sl.log
ml.log

For one of the other nodes we are getting the same error.

The other node we are getting:
ml.log
sl.log
sn.log

Ultimate-Storm · 2024-03-05T08:19:49Z

We just had another wrong with the same setup. A different error message pops up but with similar behavior. The training has been going well for 4 hours but one of the nodes had a failure that stopped the training process completely. I have checked their env and the network is completely fine, and no other tasks is running at the same time.

Logs form sentinel node:
swop_fail.log
swci_fail.log
sn_fail.log
sl_fail.log
ml_fail.log

Logs from the failing node:
ml_fail.log
sl_fail.log
sn_fail.log
swop_fail.log

Ultimate-Storm · 2024-03-06T15:13:24Z

We are facing the same problem again even by extending the SL failure timeout to long enough. One of the nodes just got frozen during the merging process and the other nodes are receiving this error:
Sync Swarm call to SL container failed - SL error: [Errno 2] No such file or directory: '/platform/swarm/SMLNODE/fs/sync/MP_STMERGED_CLdefaultbb.cqdb.sml.hpe_ID172.24.4.73_SY6.bin'

I am attaching the logs here:
Frozen node:
fail_sl.log

Other node:
fail_sl.log

In addition, we have also been experiencing a similar node frozen problem before. With an older SL version below 2.0.0, with fastai platform and a very small model.
Each merging round used to finish within 1 minute but by change, one node can be waiting for merge forever.

RadhakrishnaJ · 2024-03-06T16:22:18Z

We are facing the same problem again even by extending the SL failure timeout to long enough. One of the nodes just got frozen during the merging process and the other nodes are receiving this error: Sync Swarm call to SL container failed - SL error: [Errno 2] No such file or directory: '/platform/swarm/SMLNODE/fs/sync/MP_STMERGED_CLdefaultbb.cqdb.sml.hpe_ID172.24.4.73_SY6.bin'

I am attaching the logs here: Frozen node: fail_sl.log

Other node: fail_sl.log

In addition, we have also been experiencing a similar node frozen problem before. With an older SL version below 2.0.0, with fastai platform and a very small model. Each merging round used to finish within 1 minute but by change, one node can be waiting for merge forever.

Please add SL container logs. Attached Logs are ML container related.

Ultimate-Storm · 2024-03-08T07:43:55Z

swarm_logs.zip

Ultimate-Storm · 2024-03-08T11:50:57Z

logs from the failing node:
swarm_logs (1).zip

htjain · 2024-03-13T06:25:11Z

@Ultimate-Storm Can you please re-upload logs? I am getting 404 while downloading attached logs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internal Server Error encountered during training process #239

Internal Server Error encountered during training process #239

Ultimate-Storm commented Feb 26, 2024 •

edited

Loading

Ultimate-Storm commented Feb 26, 2024 •

edited

Loading

Ultimate-Storm commented Mar 5, 2024

Ultimate-Storm commented Mar 6, 2024

RadhakrishnaJ commented Mar 6, 2024

Ultimate-Storm commented Mar 8, 2024

Ultimate-Storm commented Mar 8, 2024

htjain commented Mar 13, 2024

Internal Server Error encountered during training process #239

Internal Server Error encountered during training process #239

Comments

Ultimate-Storm commented Feb 26, 2024 • edited Loading

Issue description

Run the SN container

Run the SWOP container

Start the SWCI container

Swarm Learning Version:

OS and ML Platform

Quick Checklist: Respond [Yes/No]

Additional notes

NOTE: Create an archive with supporting artifacts and attach to issue, whenever applicable.

Ultimate-Storm commented Feb 26, 2024 • edited Loading

Ultimate-Storm commented Mar 5, 2024

Ultimate-Storm commented Mar 6, 2024

RadhakrishnaJ commented Mar 6, 2024

Ultimate-Storm commented Mar 8, 2024

Ultimate-Storm commented Mar 8, 2024

htjain commented Mar 13, 2024

Ultimate-Storm commented Feb 26, 2024 •

edited

Loading

Ultimate-Storm commented Feb 26, 2024 •

edited

Loading