Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Restore for OperatorField data #1646

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Add Restore for OperatorField data #1646

wants to merge 6 commits into from

Conversation

jeremylt
Copy link
Member

Fixes #1639

What do we think @jrwrigh. You can see how this does get a bit invasive.

@jeremylt
Copy link
Member Author

ToDo: Track down all the CPU memory leaks, add GPU restores, language wrapers

Copy link
Collaborator

@jrwrigh jrwrigh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question is whether the extra check during the restore is worth the new API, or just force the users to call *Destroy on the objects obtained from CeedOperatorFieldGet*.

I'm not sure what bugs the extra CeedCheck(*vec == op_field->vec, ...) would catch.

See later comment for the fluids code that drove me to the above question.

backends/ref/ceed-ref-operator.c Outdated Show resolved Hide resolved
backends/ref/ceed-ref-operator.c Outdated Show resolved Hide resolved
examples/fluids/src/setuplibceed.c Outdated Show resolved Hide resolved
interface/ceed-preconditioning.c Outdated Show resolved Hide resolved
@jeremylt jeremylt force-pushed the jeremy/get-restore branch 2 times, most recently from 2361dc6 to 54e15cd Compare August 19, 2024 21:37
@jeremylt
Copy link
Member Author

Ok, rebased to work off the branch with the lingering Operator memory fix. Impl for CPU ref is in, rest in progress

@jeremylt jeremylt force-pushed the jeremy/get-restore branch 3 times, most recently from 4502c7f to 02da2a3 Compare August 20, 2024 16:05
@jeremylt
Copy link
Member Author

CPU blocked and ref in, GPU non-gen next. (would need to rebase in another open PR to do gen)

OCCA will be annoying so I'm saving it for last

@jeremylt jeremylt force-pushed the jeremy/get-restore branch 13 times, most recently from fc5c85c to b377dfe Compare August 23, 2024 21:20
@jeremylt jeremylt force-pushed the jeremy/get-restore branch 4 times, most recently from 6feaff7 to 5862f7b Compare September 6, 2024 18:20
@jeremylt
Copy link
Member Author

jeremylt commented Sep 9, 2024

Ok, should be good to review now

Copy link
Collaborator

@jrwrigh jrwrigh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll run on Sunspot real quick to verify the SYCL changes.


CeedCallBackend(CeedOperatorFieldGetVector(output_fields[i], &vec));
if (vec == CEED_VECTOR_ACTIVE) {
CeedCallBackend(CeedOperatorFieldGetBasis(output_fields[i], &basis_out));
CeedCallBackend(CeedOperatorFieldGetElemRestriction(output_fields[i], &rstr_out));
CeedCheck(!rstr_out || rstr_out == rstr_in, ceed, CEED_ERROR_BACKEND, "Backend does not implement multi-field non-composite operator assembly");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the change in check here deliberate?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously it aligns more with the CUDA and HIP backends, but just making sure we're at least mostly certain that the check isn't necessary anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SYCL backend is a huge problem IMHO. During development, changes were made from the CUDA/HIP source that didn't have well captured reasons, and since then the CUDA/HIP source has changed, so its really tricky to make good changes here, especially with the limited access to a good development environment. I don't see a good fix in the near term

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wonder if it's worthwhile getting you access to at least sunspot (even if it isn't Aurora right now) just for the backend support.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jedbrown Thoughts on getting Jeremy access to Sunspot? CI would be nice too, but these kind of dev changes would be easier in a shorter debug-loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giving me a way to build and test would be very helpful - I'd like to make a number of code quality fixes and transfer some CUDA and HIP improvements over

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have authority to grant Sunspot access (cc: @KennethEJansen ). I would consider buying a (modestly priced) Intel GPU to put in Noether for dev/testing. Cursory web search hasn't been good at finding actual vendors and I don't know what pricing on the Max 1100 looks like.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, one way or another it would be hugely helpful to the health of those backends to get me access to Sunspot or to get an Intel card somewhere ourselves, otherwise they run the risk of becoming like the OCCA backend

@jrwrigh
Copy link
Collaborator

jrwrigh commented Sep 11, 2024

I've pushed up more fixes to jrwrigh/get-restore, but there are still a lot of failures to do with the assembly. The failing tests are:

Test: t537-operator
Test: t569-operator
Test: t537-operator
Test: t569-operator
Test: t536-operator-f
Test: t533-operator
Test: t506-operator
Test: t534-operator-f
Test: t536-operator
Test: t570-operator
Test: t534-operator
Test: t533-operator-f
Test: t535-operator-f
Test: t535-operator
Test: t538-operator
Test: t506-operator-f

t570-operator fails with:

/home/jrwrigh/software/libCEED/backends/sycl-ref/ceed-sycl-ref-operator.sycl.cpp:1086 in CeedSingleOperatorAssembleSetup_Sycl(): Cannot assemble operator without inputs/outputs

where it's num_eval_mode_out = 0 that is triggering the failure.
t504-operator fails with:

Large Problem, Component 1: Computed Area 1.254524 != True Area 1.0

The rest are correctness failures in assembly.

I'd try limiting the SYCL changes to the absolute minimum required to get this PR merged.

@jeremylt
Copy link
Member Author

Those changes in that branch look good, so feel free to push them to this branch

@jeremylt jeremylt force-pushed the jeremy/get-restore branch 2 times, most recently from 8a95c7e to be77d12 Compare September 17, 2024 17:20
@jeremylt
Copy link
Member Author

@jrwrigh I think I put back the pieces that were accidentally lost

@jrwrigh
Copy link
Collaborator

jrwrigh commented Sep 17, 2024

Needed to apply the following diff to compile:

diff --git i/backends/sycl-ref/ceed-sycl-ref-operator.sycl.cpp w/backends/sycl-ref/ceed-sycl-ref-operator.sycl.cpp
index 9d2ecec6..a784e1e5 100644
--- i/backends/sycl-ref/ceed-sycl-ref-operator.sycl.cpp
+++ w/backends/sycl-ref/ceed-sycl-ref-operator.sycl.cpp
@@ -662,7 +662,7 @@ static inline int CeedOperatorAssembleDiagonalSetup_Sycl(CeedOperator op) {
       CeedBasis           basis;

       CeedCallBackend(CeedOperatorFieldGetElemRestriction(op_fields[i], &elem_rstr));
-      CeedCheck(!rstr_in || rstr_in == rstr, ceed, CEED_ERROR_BACKEND,
+      CeedCheck(!rstr_in || rstr_in == elem_rstr, ceed, CEED_ERROR_BACKEND,
                 "Backend does not implement multi-field non-composite operator diagonal assembly");
       if (!rstr_in) CeedCallBackend(CeedElemRestrictionReferenceCopy(elem_rstr, &rstr_in));
       CeedCallBackend(CeedOperatorFieldGetBasis(op_fields[i], &basis));

But I'm still seeing correctness failures for:

t535-operator
t536-operator
t537-operator
t533-operator
t533-operator-f
t506-operator
t536-operator-f
t535-operator-f
t506-operator-f
t538-operator
t534-operator-f
t537-operator
t534-operator

All assembly errors though, so you managed to squash some of the issues.

@jeremylt
Copy link
Member Author

I'll keep plunking away, thanks for the cross-check!

@jeremylt
Copy link
Member Author

t506 is the odd one, as that's using the same QF for 2 different CeedOperators

@jeremylt
Copy link
Member Author

Note

make prove -j BACKENDS="/gpu/sycl/ref /gpu/sycl/gen" PROVE_OPTS=-v

will give the details on the failures for just those

@jrwrigh
Copy link
Collaborator

jrwrigh commented Sep 17, 2024

Fixed t506, but the others are still failing: sorted this time

t533-operator
t533-operator-f
t534-operator
t534-operator-f
t535-operator
t535-operator-f
t536-operator
t536-operator-f
t537-operator
t537-operator
t537-operator
t538-operator

They fail with all the SYCL backends (ref, shared, and gen). Not too surprising since the assembly always goes through the ref backend, but just another sanity check.

@jeremylt
Copy link
Member Author

jeremylt commented Sep 17, 2024

t506 better is good. That narrows my search. Just operator diagonal left to dig into

Its the same routine for all of them, so now you only need to look at /gpu/sycl/ref

@jeremylt
Copy link
Member Author

Found another difference. Any useful error messages in the test suite output with PROVE_OPTS=-v or is it just "entry x is different" sort of stuff?

@jrwrigh
Copy link
Collaborator

jrwrigh commented Sep 17, 2024

I've been running with make junit to get the error messages. But yeah, they're just:

[0] Error in assembly: 0.000000 != 0.002963
[1] Error in assembly: 0.000000 != 0.011852

or similar.

@jeremylt
Copy link
Member Author

Interesting, I wouldn't have expected getting 0s

@jrwrigh
Copy link
Collaborator

jrwrigh commented Sep 17, 2024

Yeah, all the errors are in the form Error in assembly: 0.0000 != ....

@jeremylt
Copy link
Member Author

Huzzah, at long last. I think this is good to add now, and I can spin up the companion PR to MFEM. Mildly disruptive but not a huge deal as it just leaks if not fixed downstream

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create CeedOperatorFieldGet*Read routines with const parameters
3 participants