-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster mont_sqr
making use of symmetry
#19
Comments
It looks like this is covered in your paper (http://www.acsel-lab.com/arithmetic/arith23/data/1616a047.pdf). I'm working on grokking the relevant section, the new question is: was it determined possible but not implemented? or determined impossible with row/column oriented design that CGBN uses. |
I'm currently looking at core_mont_xmad.cu because that's the code that executes on my 1080ti ( I see the double nested for loop handling each limb of |
I cloned I used a similar approach to your xmad two stage 16 bit alignment trick. My code only works when each thread uses all limbs (e.g.
I'm not sure this is workable because of #15 where the final Even if the short comings could be addressed my code ended up being 10% slower! My guess is because of thread divergence so fixing the problem with |
GMP performs basecase squaring ~1.5x faster than multiplication of
a*b
by noticing the upper half of the cross-product is symmetric and skipping those duplicated multiplications.I've read over the core_mont.cu code, I believ that core_mont.cu is used when TPI=limps. In this case it's not clear how to do fast exp.
Have you thought about this possibility?
Have you seen this anyone take advantage of this on a GPU?
The text was updated successfully, but these errors were encountered: