--- Log opened Sun Aug 01 00:00:25 2021
04:15 < elichai2> What's the upside of #967? Is it faster?
06:32 < sipa> elichai2: hopefully
06:35 < elichai2> I'm curious about the reasons that led you to test that (using the full 64 bit is somewhat the "natural" choice, but I remember people saying that you should use 52 bit elements because of performance)
06:37 < sipa> elichai2: the reason to test it is kaushik who is working on writing optimized assembly having suggested that
06:41 < sipa> the tradeoff is that with a 5-limb representation (5x52) multiplications are more expensove (there are 25 cross products of limbs to compute), but everything else is cheaper, because e.g. additions doesn't need a carry operation
06:42 < sipa> with a 4-limb representation, multiplication is cheaper (only 16 cross products), but you need to perform carries and mod reductions all the time
06:43 < sipa> this 5x64 representation is something in between: it does need carries for every operation, but no reduction
06:44 < sipa> and then before multiplication, we reduce both arguments from 5 to 4 limbs, and use the 16-product multiplication algorithm
06:45 < sipa> the downside of this approach is that it strongly relies on having fast addition-carry chains
06:45 < sipa> which you can't portably write in C (at least not with current compilers)
06:46 < sipa> so switching from this 5x52 to.5x64 representation without asm optimization generally makes things slower
06:46 < sipa> but with sufficient optimization does appear to be actually faster
07:00 < elichai2> That's interesting, Thanks!
07:01 < elichai2> I thought these days llvm/gcc do produce adc correctly but I guess I hoped for too much haha
07:03 < sipa> it does, for things that correspond to C code
07:04 < sipa> but if you need hacks in C to express simple operations in a complicated way, the compiler can't see that it actually could use a simple instruction
07:05 < sipa> say you have three 64-bit integers (c0,c1,c2,c3) that represent a 256-bit number, and you want to add the value of a (64 bits) to it
07:05 < sipa> on the cpu that's just {add c0,a; adc c1,$0; adc c2,$0; adc c3,$0}
07:07 < sipa> but in C you need something like: c0=l+=a; carry=c0<a; c1+=carry; carry=c1<carry; c2 += carry; carry = c2<carry; c3 += carry;}
07:10 < sipa> or you can do something like uint128 acc=c0; acc += c2; c0 = acc; c0 >>= 64; acc += c1; c1 = acc; acc >>= 64; acc += c2; c2 = acc; acc >>= 64; c3 += acc;
07:10 < sipa> but both of these compile to way more instructions than the simple 4 that are necessaryry
07:11 < sipa> gcc even has a __builtin_add_overflow, which does in-plave update while returning the carry - but that still prodices far more complicated code than needed
07:11 < sipa> if you find a magic formula that compiles to just add,adc,adc,adc, let me know!
07:28 < elichai2> apparently in rust it works: https://godbolt.org/z/rqdfb6r4G
07:29 < elichai2> using arrays: https://godbolt.org/z/d8nj4qboe
07:29 < elichai2> but I can't get it to work in C
07:40 < elichai2> so using builtins it does look possible: https://godbolt.org/z/q5n5rr1bo
07:42 < sipa> interesting
07:42 < elichai2> gcc seems to produce much worse code
07:43 < elichai2> in clang I can do this even without intrisics: https://godbolt.org/z/qbMehEasz
07:45 < sipa> surprising
08:03 -!- jesseposner [~jesse@2601:647:0:89:dc4d:9de6:d972:6985] has quit [Ping timeout: 252 seconds]
14:46 -!- belcher_ is now known as belcher
16:03 -!- meshcollider [meshcollid@meshcollider.jujube.ircnow.org] has quit [Remote host closed the connection]
16:25 -!- meshcollider [meshcollid@jujube.ircnow.org] has joined #secp256k1
17:14 -!- belcher_ [~belcher@user/belcher] has joined #secp256k1
17:15 -!- ariard__ is now known as ariard
17:17 -!- belcher [~belcher@user/belcher] has quit [Ping timeout: 272 seconds]
23:41 -!- belcher_ is now known as belcher
--- Log closed Mon Aug 02 00:00:26 2021