Faster `mont_sqr` making use of symmetry #19

sethtroisi · 2021-11-04T05:35:11Z

GMP performs basecase squaring ~1.5x faster than multiplication of a*b by noticing the upper half of the cross-product is symmetric and skipping those duplicated multiplications.

I've read over the core_mont.cu code, I believ that core_mont.cu is used when TPI=limps. In this case it's not clear how to do fast exp.

Have you thought about this possibility?
Have you seen this anyone take advantage of this on a GPU?

The text was updated successfully, but these errors were encountered:

sethtroisi · 2021-11-04T06:03:29Z

It looks like this is covered in your paper (http://www.acsel-lab.com/arithmetic/arith23/data/1616a047.pdf). I'm working on grokking the relevant section, the new question is: was it determined possible but not implemented? or determined impossible with row/column oriented design that CGBN uses.

sethtroisi · 2021-11-04T22:48:37Z

I'm currently looking at core_mont_xmad.cu because that's the code that executes on my 1080ti (CUDA_ARCH = 610)

I see the double nested for loop handling each limb of a and b and it's much more clear what the path forward is.

sethtroisi · 2021-11-12T02:55:51Z

I cloned mont_mul from core_mont_imad.cu and specialized it for squaring.

I used a similar approach to your xmad two stage 16 bit alignment trick.

My code only works when each thread uses all limbs (e.g. BITS is a multiple of 32 * TPI) a little more work would be needed in the case that some limbs have padding.

I calculate (A[i] * A[i] >> 1) + sum(A[i] * A[j] for j > i) then double that summation at the end
- This has several advantages.
  1. Only a single carry pass for adding both the squared and non squared terms
  2. The lowest shifted bit ((A[i] * A[i]) & 1 = A[i] & 1 can be handled by the lower thread easily without passing around data
- And some disadvantages
  1. For odd input A[] I handle the the low bit of the first squaring by calculating 2^-1 mod N = N/2 + 1 and initializing x[] = N/2 + 1
    - Alternatively I could add R^-1 mod N at the very end but I don't have this sum precomputed
  2. At the end I have to double x[] which can cause overflow which can't be easily handled
    - e.g. 2*x[] - n can still be > R

I'm not sure this is workable because of #15 where the final x[] > N and 2*x[] > R and requires hundreds (or thousands of subtractions of N). Reading other papers I thought I was guaranteed x[] < 2 * N but this doesn't appear to be the case. (maybe because I start with some value in the summation?)

Even if the short comings could be addressed my code ended up being 10% slower! My guess is because of thread divergence so fixing the problem with 2*x[] > R would just be an intellectual exercise

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster `mont_sqr` making use of symmetry #19

Faster `mont_sqr` making use of symmetry #19

sethtroisi commented Nov 4, 2021 •

edited

Loading

sethtroisi commented Nov 4, 2021

sethtroisi commented Nov 4, 2021 •

edited

Loading

sethtroisi commented Nov 12, 2021 •

edited

Loading

Faster mont_sqr making use of symmetry #19

Faster mont_sqr making use of symmetry #19

Comments

sethtroisi commented Nov 4, 2021 • edited Loading

sethtroisi commented Nov 4, 2021

sethtroisi commented Nov 4, 2021 • edited Loading

sethtroisi commented Nov 12, 2021 • edited Loading

Faster `mont_sqr` making use of symmetry #19

Faster `mont_sqr` making use of symmetry #19

sethtroisi commented Nov 4, 2021 •

edited

Loading

sethtroisi commented Nov 4, 2021 •

edited

Loading

sethtroisi commented Nov 12, 2021 •

edited

Loading