Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spindle could not connect to session #44

Open
vsoch opened this issue Jul 21, 2021 · 9 comments
Open

Spindle could not connect to session #44

vsoch opened this issue Jul 21, 2021 · 9 comments

Comments

@vsoch
Copy link

vsoch commented Jul 21, 2021

I'm getting errors in testing and attempted usage that Spindle cannot connect to some session. I'm installing as follows:

./configure --with-munge-dir=/etc/munge --enable-sec-munge --with-slurm-dir=/etc/slurm --with-testrm=slurm
make
make install

And I've tried that with both slurm and openmpi as the "testrm" And then I make the tests

cd testsuite
make
./runTests

but no matter what I do (using the slurm or openmpi template, both of which I have) I see this error:

Running: ./run_driver --partial --session
ERROR: Spindle could not connect to session tn2VYQ

I saw this same error in trying to just use spindle so I've gone back to the tests to debug. Note that I do have a /tmp area:

 ls /tmp/
ccFjQGLR.s  ks-script-eC059Y  spin.kT6PPu  spin.tn2VYQ  spin.Un7RTL  yum.log

Update: I think it could possibly be that they need to see the same /tmp area - so I'm rebuilding the containers with a shared /tmp area and will report back.

@vsoch
Copy link
Author

vsoch commented Jul 21, 2021

okay I have spindle tests started running, and I think I might not have enough resources because my tiny cluster hangs on:

# ./runTests 
Running: ./run_driver --dependency --push
srun: Requested partition configuration not available now
srun: job 3 queued and waiting for resources

What does spindle require for resources given slurm testing?

@vsoch
Copy link
Author

vsoch commented Jul 21, 2021

Going to try openmpi now

@vsoch
Copy link
Author

vsoch commented Jul 21, 2021

When I try testing with openmpi:

Spindle Error: Could not identify system job launcher in command line
Running: ./run_driver --dlopen --preload

and then the same error about not being able to connect to a session.

@mplegendre
Copy link
Member

If you were using spindle with slurm 20.11+, then I just pushed a fix for running spindle with that version of slurm to devel. The issue could have produced the hang you were seeing.

@vsoch
Copy link
Author

vsoch commented Aug 14, 2021

Quick test of a build and I'm seeing:

#18 1.985 checking slurm version for compatibility... no
#18 1.994 configure: error: Slurm support was requested, but slurm 20.11.8, which is later than 20.11, was detected.  This version of slurm breaks spindle daemon launch.  You can disable this error message and build spindle with slurm-based daemon launching anyways by explicitly passing the --with-slurm-launch option (you might still be able to get spindle to work by running jobs with srun's --overlap option).  Or you could switch to having spindle launch daemons with rsh/ssh by passing the --with-rsh-launch option, and ensuring that rsh/ssh to nodes works on your cluster.

I'll listen to the message and try out those various options, probably not right now because I'm tired, but will update here with what I find.

@vsoch
Copy link
Author

vsoch commented Aug 14, 2021

okay - so I gave a shot to rebuild and add the --with-slurm-launch option for 20.11.8. That compiled correctly removing the previous error message, but I had other issues with getting the slurm cluster working in docker-compose - I didn't want to confound these possible new issues with spindle so I fell back to an older version of slurm, 18.x.x. Seeing that my previous message job 3 was queued and waiting for resources, I tried this again and then looked at the queue and I see:

$ docker exec -it slurmdbd bash
[root@slurmdbd /]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 3    normal spindle_     root PD       0:00      2 (PartitionNodeLimit)

So can I ask again - how many concurrent nodes are required for spindle to run tests with slurm?

@vsoch
Copy link
Author

vsoch commented Aug 14, 2021

What command do you usually use for pynamic? I can try that instead.

@vsoch
Copy link
Author

vsoch commented Aug 14, 2021

Okay this looks to work for pynamic, although still no go to add spindle.

$ time python config_pynamic.py 30 1250 -e -u 350 1250 -n 150

************************************************
summary of pynamic-sdb-pyMPI executable and 10 shared libraries
Size of aggregate total of shared libraries: 2.5MB
Size of aggregate texts of shared libraries: 6.8MB
Size of aggregate data of shared libraries: 408.4KB
Size of aggregate debug sections of shared libraries: 0B
Size of aggregate symbol tables of shared libraries: 0B
Size of aggregate string table size of shared libraries: 0B
************************************************

real	21m33.556s
user	14m54.538s
sys	3m31.206s

@mplegendre
Copy link
Member

What's happening here is that there's a bug/feature in Slurm 20.11+ that makes it so Spindle can't launch its daemons with Slurm. The "checking slurm version for compatibility... no" means you're hitting that. There's two autoconf-level options:

  1. Build with "--with-slurm-launch", which tells Spindle to build anyways and still try to use slurm. But without a slurm fix, this is unlikely to get anywhere.
  2. Use Spindle's rsh launching mode with the "--with-rsh-launch" option. If you have multiple nodes in your cluster, and configure them so that rsh or ssh can execute commands without passwords across nodes, then Spindle can use this to start its daemons.

You'll probably have to use option 2 here. Or you could regress your slurm version.

And I'd usually run pynamic based on the README.md commands in its repo. So something like: srun pyMPI pynamic_driver.py date +%s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants