Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connectable #1249

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Connectable #1249

wants to merge 3 commits into from

Conversation

stevenrbrandt
Copy link
Member

Make it possible to connect new resources to a running Phylanx calculation.

This doesn't actually work. I need guidance.

Problems:
(1) Not sure how to use params with the longer version of hpx::init()
(2) I can only use it to add one locality
./build.Release/bin/physl --connect sleep.p # works
./build.Release/bin/physl --hpx:hpx=127.0.0.1:7910 --connect sleep.p # dies instantly
(3) My guess is you don't want sleep() implemented as a math function, though it works fine.

@hkaiser hkaiser added this to the 0.0.1 milestone Sep 9, 2020
@hkaiser
Copy link
Member

hkaiser commented Sep 9, 2020

Ok, I think the main problem is that the way HPX TCP connections work is not documented adequately. Here are the basics:

Every HPX locality needs to listen to a unique TCP/IP address (IP-address/hostname + port number). Most of the time, HPX can make sure this condition is met (in SLURM or PBS batch environments, for instance), but sometimes you need to help by providing the necessary information.

If the HPX localities run on different nodes this is easily achieved, as every locality uses hostname:7910 as their default, which is unique by definition.

If the localities run on the same node, locality zero uses hostname:7910 (again, by default) but all other localities have to be told to use some other port.

Connecting localities use hostname:7909 as their default, which is why you can have one connecting locality without problems.

To have a second locality or more, you will have to make sure they use a unique port, for instance hostname:7908, etc.

HPX has two command-line options to specify the IP addresses to bind their sockets to: --hpx:hpx defines the address a locality will use to listen for incoming parcels, and --hpx:agas defines the address a locality should use to connect to AGAS (usually locality zero).

That said, to run a base locality and two connecting localities on the same node, you could do:

./base_locality --hpx:agas=hostname:7910 --hpx:hpx=hostname:7910
./connecting_locality --hpx:agas=hostname:7910 --hpx:hpx=hostname:7909
./connecting_locality --hpx:agas=hostname:7910 --hpx:hpx=hostname:7908

(note, some of the options could be left out to utilize the built-in defaults, but I have listed the full set to clarify things).

By the way, there is also the command-line option --hpx:connect that can be passed to any locality to instruct it to connect to a running application. IOW, you could do:

./base_locality --hpx:agas=hostname:7910 --hpx:hpx=hostname:7910
./base_locality --hpx:agas=hostname:7910 --hpx:hpx=hostname:7909 --hpx:connect
./base_locality --hpx:agas=hostname:7910 --hpx:hpx=hostname:7908 --hpx:connect

(note, all three launch the same executable) and it should still work.

@hkaiser
Copy link
Member

hkaiser commented Sep 9, 2020

Problems:
(1) Not sure how to use params with the longer version of hpx::init()

This is probably what you're looking for: https://hpx-docs.stellar-group.org/latest/html/libs/init_runtime/api.html?highlight=init_params#_CPPv4N3hpx4initEiPPcRK11init_params

(2) I can only use it to add one locality
./build.Release/bin/physl --connect sleep.p # works
./build.Release/bin/physl --hpx:hpx=127.0.0.1:7910 --connect sleep.p # dies instantly

See my comment above for some explanations.

(3) My guess is you don't want sleep() implemented as a math function, though it works fine.

Correct, I don't think we should do that. We need a simpler way to add primitives (we have one, but I don't like it ;-), so I'll think about it (see #1250).

@stevenrbrandt
Copy link
Member Author

@hkaiser
Note that

./build.Release/bin/physl --hpx:hpx=127.0.0.1:7909 --connect

Also exits instantly. Somehow, the hpx arguments are not compatible with the physl arguments. Not sure why.

@hkaiser
Copy link
Member

hkaiser commented Sep 10, 2020

./build.Release/bin/physl --hpx:hpx=127.0.0.1:7909 --connect

I don't know anything about the --connect option. I'm not sure what you mean.

@stevenrbrandt
Copy link
Member Author

@hkaiser the connect option was something I added for the PR, so that I could call finalize instead of disconnect, etc.

@stevenrbrandt
Copy link
Member Author

I now see that the modification to physl wasn't needed, the option --hpx:connect does it.

@stevenrbrandt
Copy link
Member Author

OK, I can connect localities, but I cannot use them. So I have a main process which waits for 4 localties, then tries to run a cannon product (which requires 4 localities) using this script can.p:

define(
    cannon,
    size,
    block(
        define(
            nl,
            num_localities()
        ),
        while(
            __lt(nl, 4),
            block(
                cout(nl),
                sleep(1),
                store(
                    nl,
                    num_localities()
                )
            )
        ),
        cout("cannon!"),
        define(
            array1,
            random_d(
                list(size, size),
                find_here(),
                num_localities()
            )
        ),
        define(
            array2,
            random_d(
                list(size, size),
                find_here(),
                num_localities()
            )
        ),
        cannon_product_d(array1, array2)
    )
)
cannon(120)

Then I have some other processes which just run sleep.p

sleep(10)

I then try to orchestrate things by calling this script: run.sh

./build.Release/bin/physl --hpx:ini=hpx.parcel.tcp.enable=1 \
    --hpx:threads=2 --hpx:expect-connecting-localities can.p &

sleep 2

echo attach procs
for port in 7913 7911 7912
do
    echo PORT $port
    ./build.Release/bin/physl --hpx:threads=2 --hpx:ini=hpx.parcel.tcp.enable=1 --hpx:hpx=127.0.0.1:$port --hpx:connect sleep.p &
done

while wait
do
    sleep 1
done
echo "DONE"

The 4 localities are obtained, but when the cannon product is attempted, the code hangs. Thoughts?

@stevenrbrandt
Copy link
Member Author

@hkaiser I also attempted to have all localities run the same code, i.e. can.p. The all print cannon! and then all hang.

@hkaiser
Copy link
Member

hkaiser commented Sep 10, 2020

@hkaiser I also attempted to have all localities run the same code, i.e. can.p. The all print cannon! and then all hang.

That's progress, I guess ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants