--- Log opened Sun Aug 01 00:00:25 2021 04:15 < elichai2> What's the upside of #967? Is it faster? 06:32 < sipa> elichai2: hopefully 06:35 < elichai2> I'm curious about the reasons that led you to test that (using the full 64 bit is somewhat the "natural" choice, but I remember people saying that you should use 52 bit elements because of performance) 06:37 < sipa> elichai2: the reason to test it is kaushik who is working on writing optimized assembly having suggested that 06:41 < sipa> the tradeoff is that with a 5-limb representation (5x52) multiplications are more expensove (there are 25 cross products of limbs to compute), but everything else is cheaper, because e.g. additions doesn't need a carry operation 06:42 < sipa> with a 4-limb representation, multiplication is cheaper (only 16 cross products), but you need to perform carries and mod reductions all the time 06:43 < sipa> this 5x64 representation is something in between: it does need carries for every operation, but no reduction 06:44 < sipa> and then before multiplication, we reduce both arguments from 5 to 4 limbs, and use the 16-product multiplication algorithm 06:45 < sipa> the downside of this approach is that it strongly relies on having fast addition-carry chains 06:45 < sipa> which you can't portably write in C (at least not with current compilers) 06:46 < sipa> so switching from this 5x52 to.5x64 representation without asm optimization generally makes things slower 06:46 < sipa> but with sufficient optimization does appear to be actually faster 07:00 < elichai2> That's interesting, Thanks! 07:01 < elichai2> I thought these days llvm/gcc do produce adc correctly but I guess I hoped for too much haha 07:03 < sipa> it does, for things that correspond to C code 07:04 < sipa> but if you need hacks in C to express simple operations in a complicated way, the compiler can't see that it actually could use a simple instruction 07:05 < sipa> say you have three 64-bit integers (c0,c1,c2,c3) that represent a 256-bit number, and you want to add the value of a (64 bits) to it 07:05 < sipa> on the cpu that's just {add c0,a; adc c1,$0; adc c2,$0; adc c3,$0} 07:07 < sipa> but in C you need something like: c0=l+=a; carry=c0 or you can do something like uint128 acc=c0; acc += c2; c0 = acc; c0 >>= 64; acc += c1; c1 = acc; acc >>= 64; acc += c2; c2 = acc; acc >>= 64; c3 += acc; 07:10 < sipa> but both of these compile to way more instructions than the simple 4 that are necessaryry 07:11 < sipa> gcc even has a __builtin_add_overflow, which does in-plave update while returning the carry - but that still prodices far more complicated code than needed 07:11 < sipa> if you find a magic formula that compiles to just add,adc,adc,adc, let me know! 07:28 < elichai2> apparently in rust it works: https://godbolt.org/z/rqdfb6r4G 07:29 < elichai2> using arrays: https://godbolt.org/z/d8nj4qboe 07:29 < elichai2> but I can't get it to work in C 07:40 < elichai2> so using builtins it does look possible: https://godbolt.org/z/q5n5rr1bo 07:42 < sipa> interesting 07:42 < elichai2> gcc seems to produce much worse code 07:43 < elichai2> in clang I can do this even without intrisics: https://godbolt.org/z/qbMehEasz 07:45 < sipa> surprising 08:03 -!- jesseposner [~jesse@2601:647:0:89:dc4d:9de6:d972:6985] has quit [Ping timeout: 252 seconds] 14:46 -!- belcher_ is now known as belcher 16:03 -!- meshcollider [meshcollid@meshcollider.jujube.ircnow.org] has quit [Remote host closed the connection] 16:25 -!- meshcollider [meshcollid@jujube.ircnow.org] has joined #secp256k1 17:14 -!- belcher_ [~belcher@user/belcher] has joined #secp256k1 17:15 -!- ariard__ is now known as ariard 17:17 -!- belcher [~belcher@user/belcher] has quit [Ping timeout: 272 seconds] 23:41 -!- belcher_ is now known as belcher --- Log closed Mon Aug 02 00:00:26 2021