Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster mont_sqr making use of symmetry #19

Open
sethtroisi opened this issue Nov 4, 2021 · 3 comments
Open

Faster mont_sqr making use of symmetry #19

sethtroisi opened this issue Nov 4, 2021 · 3 comments

Comments

@sethtroisi
Copy link
Contributor

sethtroisi commented Nov 4, 2021

GMP performs basecase squaring ~1.5x faster than multiplication of a*b by noticing the upper half of the cross-product is symmetric and skipping those duplicated multiplications.

I've read over the core_mont.cu code, I believ that core_mont.cu is used when TPI=limps. In this case it's not clear how to do fast exp.

Have you thought about this possibility?
Have you seen this anyone take advantage of this on a GPU?

@sethtroisi
Copy link
Contributor Author

It looks like this is covered in your paper (http://www.acsel-lab.com/arithmetic/arith23/data/1616a047.pdf). I'm working on grokking the relevant section, the new question is: was it determined possible but not implemented? or determined impossible with row/column oriented design that CGBN uses.

@sethtroisi
Copy link
Contributor Author

sethtroisi commented Nov 4, 2021

I'm currently looking at core_mont_xmad.cu because that's the code that executes on my 1080ti (CUDA_ARCH = 610)

I see the double nested for loop handling each limb of a and b and it's much more clear what the path forward is.

@sethtroisi
Copy link
Contributor Author

sethtroisi commented Nov 12, 2021

I cloned mont_mul from core_mont_imad.cu and specialized it for squaring.

I used a similar approach to your xmad two stage 16 bit alignment trick.

My code only works when each thread uses all limbs (e.g. BITS is a multiple of 32 * TPI) a little more work would be needed in the case that some limbs have padding.

  1. I calculate (A[i] * A[i] >> 1) + sum(A[i] * A[j] for j > i) then double that summation at the end
    • This has several advantages.
      1. Only a single carry pass for adding both the squared and non squared terms
      2. The lowest shifted bit ((A[i] * A[i]) & 1 = A[i] & 1 can be handled by the lower thread easily without passing around data
    • And some disadvantages
      1. For odd input A[] I handle the the low bit of the first squaring by calculating 2^-1 mod N = N/2 + 1 and initializing x[] = N/2 + 1
        • Alternatively I could add R^-1 mod N at the very end but I don't have this sum precomputed
      2. At the end I have to double x[] which can cause overflow which can't be easily handled
        • e.g. 2*x[] - n can still be > R

I'm not sure this is workable because of #15 where the final x[] > N and 2*x[] > R and requires hundreds (or thousands of subtractions of N). Reading other papers I thought I was guaranteed x[] < 2 * N but this doesn't appear to be the case. (maybe because I start with some value in the summation?)

Even if the short comings could be addressed my code ended up being 10% slower! My guess is because of thread divergence so fixing the problem with 2*x[] > R would just be an intellectual exercise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant