--- Log opened Tue Nov 12 00:00:05 2019
01:04 -!- jonatack [~jon@2a01:e35:8aba:8220:6627:dad:d967:649d] has quit [Ping timeout: 245 seconds]
01:38 -!- afk11 [~afk11@gateway/tor-sasl/afk11] has quit [Remote host closed the connection]
01:38 -!- arubi [~ese168@gateway/tor-sasl/ese168] has quit [Write error: Broken pipe]
01:38 -!- afk11 [~afk11@gateway/tor-sasl/afk11] has joined #secp256k1
01:39 -!- arubi [~ese168@gateway/tor-sasl/ese168] has joined #secp256k1
01:45 -!- jonatack [~jon@134.19.179.203] has joined #secp256k1
05:42 < elichai2> Did anyone look at the btcd implementation of secp256k1? (safety, speed etc)
05:58 < gmaxwell> elichai2: a long time ago, it's pretty slow and not even trying to resist timing sidechannels, IIRC.
05:59 < gmaxwell> I recall that has a number of really weird optimiztions. like extra functions to optimize cases that would only occure 1:2^256 times.
05:59 < gmaxwell> maybe whomever was writing it was getting paid by the line of code.
06:01 < elichai2> So any go project is better off writing bindings?
06:02 < gmaxwell> There already are bindings.
06:05 < gmaxwell> anyways, when evaluating these things I usually go look and see if there was a reasonable attempt to eliminate privkey sidechannels in keygen/signing;  and I look at the test to see if there is any chance that the tests could have caught a bug in their field arith (like a missing carry)-- this latter because there have been a number of crappy ecc libraries for bitcoin user that would actually generate
06:05 < gmaxwell> the wrong pubkey for your private key, which is a pretty devistating flaw.
06:16 < elichai2> Yeah, I'll try to review it. The first thing though is to make sure it's even verifying that the pubkeys are on the curve :)
06:17 < elichai2> Thanks for the advice
06:18 < gmaxwell> hm, https://github.com/btcsuite/btcd/blob/master/btcec/btcec.go#L268 purports to have a 7mul,4sqr gej_add_ge.  We tried the same one (I think) but needed an extra normalize which made it a performance loss.  Their field is also 10x26... so either they're faulty or perhaps it would have actually worked for us.
06:19 < gmaxwell> (seems like in general they don't realize that normalizing is slow, they do it a lot for no obvious reason, and don't account for it in their numbers)
06:26 -!- jonatack [~jon@134.19.179.203] has quit [Quit: jonatack]
06:26 -!- jonatack [~jon@134.19.179.203] has joined #secp256k1
06:26 < elichai2> I still don't quite get normalization and using limbs bigger than the field. And there's sadly no good resource for bignum field arithmetic
06:27 < elichai2> I should probably ask roasbeef why he doesn't replace the secp with bindings
06:27 < gmaxwell> elichai2: image you make a bignum the naieve way.
06:27 < gmaxwell> imagine*
06:28 < gmaxwell> so you've got like 8  32-bit words for your 256 bit value, right?
06:28 < gmaxwell> and you want to add two of those togeather.
06:29 < gmaxwell> Every time you add you need to go propagate all the carries between all the words, ... then if the amount overfolows, handle the modulus, and propagate whatever it leaves over.
06:29 < gmaxwell> That carry propagation is slow because you end up having to branch on single bits between the words (or some constant-time equivient).
06:30 < gmaxwell> So all the overcomplete representation does is gives you extra room in each of the digits so you can defer and batch performing the carries.
06:30 < gmaxwell> this makes operations like repeated additions or multiplications with small constants almost free compared to the cost of doing it with a fully packed number.
06:31 < gmaxwell> for general multiplication and squaring the overcomplete representation isn't a help... but any time there are multiple additions or multiplies with small integers its a big win.
06:33 < gmaxwell> (there are other number representations that are more efficient for lots of multiplies in a row without additions, but basically the only places we do that are the inv and sqrt ladders)
06:36 < gmaxwell> It's more obvious when you think about what addition and multiplication do to the bit-lenghts of numbers.  Adding two N-bit numbers gives you at most a N+1 bit number.  Mutiplying a N and M bit number gives you a N+M bit number... etc.
06:37 < elichai2> hmm makes sense. altough that requires having a brain model on what exactly is actually the real number (because if you just have i.e. 2 128bit integers, then it's F[0] + F[1]*2^128. but when it's overcomplete it's hard to calculate that in the head)
06:38 < gmaxwell> well it's not that different--- same kinds of exponents, but they're closer than the size of F implies.  like for 10x26  it's like you have digits with 26-bits...
06:39 < gmaxwell> It's just that they're allowed to overflow. :P so there are multiple ways to represent most numbers.
06:40 < gmaxwell> in secp256k1 there are several kinds of normalization... obviously full normalization puts it back to 52 or 26 bit digits, and makes it under the modulus...
06:40 < gmaxwell> but there are other less strong kinds of normalization used where fully normaization isn't necessary.
06:40 < elichai2> so the first digit can be half of it X*1 and the other half X*2^26? lol
06:41 < elichai2> gmaxwell: if i'm honest most of the time I normalize if the `VERIFY` complains hehe
06:42 < gmaxwell> like in base ten I would write a the number 12 as   1   2    .... but overcomplete  I could instead write 0 12,  but after normalizing it would become 1  2.
06:43 < gmaxwell> elichai2: in VERIFY builds we add a magnitude field to every field element that tracks how much bit-inflation has possibly happened, given the operations.. and there are checks that yell.
06:45 < gmaxwell> but it's not hard to anticipate the result,  addition adds on unit of magnitude,  a mul_int adds as many as the number you're mutiplying with has bits...
06:46 < gmaxwell> also take care,  5x52  has more free space than 10x26... so if you're not careful you can write something that 'works' but then doesn't work on 32-bit.
06:46 < elichai2> yeah the multiplication is a bit harder, because you need to know the size of the numbers
06:47 < gmaxwell> well you know their maximum.
06:47 < gmaxwell> like if you don't know the maximum you cant use mul_int regardless, you have to use a full field multiply.
06:48 < elichai2> wait, I thought both 5x52 and 10x26 can go to 320 bit
06:48 < elichai2> (because it's 5x64 and 10*32)
06:48 < gmaxwell> the maximum magnitude would be 64-52 and 32-26 respectively-- how much extra headroom is in each digit.
06:50 < elichai2> so why 5x52 instead of 3x86? (i.e. 3 __int128)
06:50 < gmaxwell> if you want your head to really hurt, there is some ed25519 stuff out there where they use mixed radix... where the bases of different digits are different.
06:51 < gmaxwell> elichai2: there isn't a real 128-bit register on the computers we use.  __int128 is a bit of compiler provided bignum.  (we use it in the codebase to get access to the 64x64->128 multiply that 64-bit cpus have).
06:51 -!- jonatack [~jon@134.19.179.203] has quit [Ping timeout: 276 seconds]
06:51 < gmaxwell> you could write an implementation that used 3x__int128 , but it would be slower.
06:52 < elichai2> well yeah as you said x86_64 can compile some of these down to bigger instructions, but I would think the compiler can do a good job at it
06:52 < gmaxwell> also it would be a pita to write, because somewhere inside it, it would need a 128x128->256 multiple, and there isn't one of those.
06:52 < elichai2> I really want to benchmark llvm's arbitrary precision numbers
06:53 < gmaxwell> somewhere between extremely slow and extraordinarly slow, -- by our standards.
06:53 < gmaxwell> I'm sure.
06:53 < elichai2> really?
06:53 < elichai2> the compiler guys are really that bad at bignum?
06:53 < gmaxwell> general bignum code is _slow_ because carries suck.  It's not that compilers are not good at it.
06:54 < elichai2> well gmp is pretty damn fast
06:54 < gmaxwell> if you go write bignum code, you'll see.. to handle carries you're shifting and bitmasking and doing extra adds.. and the operations are all sequential.
06:55 < elichai2> yeah I started working on one: https://github.com/elichai/ecc-secp256k1/blob/master/src/u256.rs
06:55 < elichai2> but maybe I should give 5x52 a try and implement one myself
06:55 < gmaxwell> elichai2: GMP has a lot of extraordinary algorithims in it. But for basic add/mul_int sort of stuff it's pretty slow. plus function call overheads.
06:56 < elichai2> (this code works but screws up carries sometimes lol )
06:56 < gmaxwell> lol, see also my earlier comment about reviewing implementations for testing sufficient to find carry bugs.
06:56 < elichai2> yep
06:57 < elichai2> currently by toy secp uses gmp in the background for everything so I wanted to remvoe it (potentially to add it to libsecp test framework) but my current impl has a bunch of bugs and I just left it
06:57 < gmaxwell> in any case, general arbritary precision int stuff isn't in a position to use tricks like overcomplete numbers... so they're stuck with the normal reasons that bignums are slow.
06:58 < elichai2> make sense
07:00 < gmaxwell> if if it interests you,  intel came out with some additional operations that speed up bignum math.  field arith in secp256k1 that has asm could probably be made a bit faster using them. -- intel ADX.
07:01 < gmaxwell> basically it just gives you two carry flags for the adders, so you interleave using one or the other so the cpu can do more of the additions in parallel when otherwise it couldn't due to contention for access to the carry flag.
07:03 < gmaxwell> regarding using other code as comparison point, it's somewhat better if it works differently--- reduced chances of it making the same mistake.
07:03 < elichai2> intel ADX? i'll look at it. I tried to think before if we could somehow utilize SSE/AVX in field arithmetic but it requires some parallelization
07:03 < elichai2> "it's somewhat better if it works differently" what do you mean?
07:04 < elichai2> you mean implement without looking at the current implementation in place?
07:04 < gmaxwell> I mean if you make your code 5x52 there is an increased risk that libsecp256k1 and your code both make the same error and would get consistently wrong results.  So for testing purposes its a little better if you do ... basically anything else. :P
07:05 < gmaxwell> even if you don't look if you implement the same algorithim you're likely to make the same mistakes...  varrious studies of software development show very high levels of correlation between bugs in 'independant' implementations.
07:06 < elichai2> oh hehe yeah :) altough I already found bugs in RFCs due to me implementing from the RFC and not the code (even though I then found i'm not the first to notice lol)
07:06 < gmaxwell> (which is why we can't write ultra robust software by hiring eleven freelancers to write independant versions then take the majority :P )
07:06 < elichai2> anyhow, are you talking about adding it as test or adding it in asm/C to make things faster?
07:06 < elichai2> gmaxwell: lol. I wish that would solve everything
07:07 < gmaxwell> sorry, I switched topics back on you.  For the subject of adding your code as a test it's better if youre code is uses a different algo. I mentioned that because you mentioned maybe trying 5x52.
07:08 < elichai2> right ok yeah I guess you're right. I would continue working on the u256 implementation, but it seems the PR lost attraction (probably because it's rust)
07:09 < elichai2> i'll go read on intel adx :)
07:09 < gmaxwell> as far as AVX2 goes,  it would be a relatively straight forward (lol) task to port the 10x26 field arithmetic to run 8-way parallel on AVX2.  From that you could then make an 8-way parallel version of gej_add_ge....  this would likely be a big speedup for batch verify... and useless elsewhere.
07:09 < gmaxwell> no one has done it because it's just a lot of not terribly interesting work.
07:09 < elichai2> (altough after I went to implement siphash with SSE4.2 just to see it's slower i'm pretty warry of manual instructions lol)
07:10 < gmaxwell> the 10x26 field code is a LOT slower, but running a bunch of it in parallel can make up for a lot.
07:10 < gmaxwell> but the only place where we can really make that much use of parallelism is in the pippenger code (big batches).
07:21 < gmaxwell> hm. actually. A smaller but potentially really useful AVX2 project would be to make a many way parallel sqrt. It requires fewer operations than group law (just mul and sqr)... and for big batch schnorr, sqrt() time is pretty considerable.
07:22 < gmaxwell> (because it's needed to decompress pubkeys and R values)
07:23 < gmaxwell> also because the sqrt is all muls and squares,  it's already a candidate for using montgomery multiplication.
07:24 < elichai2> aren't we using sqrt in schnorr just for quad residue?
07:24 < elichai2> (which a fast jacobi impl would probably be better than parallalizing it)
07:24 < gmaxwell> elichai2: no, (also sqrt for the jacobi symbol is a super inefficient way of doing that...)
07:24 < gmaxwell> sqrt is needed to get the y values of the pubkey and R.
07:25 < elichai2> oh gej->ge right
07:25 < gmaxwell> no, x only to ge.
07:26 < gmaxwell> gej->ge uses an inversion, but we don't ever actually need to go to GE for schnorr validation, we can do the final test in projective space. Also inversions are batchable so even if we had any, they wouldn't be a problem. :)
07:27 < elichai2> yeah gej is z^x and then devide. finding Y for X requires solving `y^2 = x^3 -C` so you need sqrt
07:27 < gmaxwell> elichai2: when we get a pubkey or R we only get the x value of the point. We can't add points with only their X values, the group law functions take x,y (or x,y,z)
07:27 < elichai2> * + 7 :)
07:27 < elichai2> so `y = sqrt(x^3+7)`
07:28 < gmaxwell> exactly.  sqrt(x^3+7) .. the sqrt has two answers but by spec, the one used on every R and P in bip-schnorr is the default one. :P
07:29 < elichai2> oh I haven't even thought of that optimization heh, I only thought about not needing to do gej->ge. not that there's no condition +negation on the y (even though it's negligibl lol)
07:29 < gmaxwell> in any case, for non-batch verification there is only one sqrt (for the P).  Also the ecmult is slower per signature for non-batch... so sqrt time is less of a concern there.
07:31 < gmaxwell> but for big batches the ecmult per input becomes pretty small, and the sqrt starts mattering more.
07:32 < gmaxwell> so that might be a fun project, since it's pretty self contained, could result in a non-trivial speedup, and it's relatively easy to review.
08:36 -!- jonatack [~jon@2a01:e35:8aba:8220:6627:dad:d967:649d] has joined #secp256k1
09:36 < nickler> elichai2: an intro to bignums https://cryptojedi.org/peter/data/pairing-20131122.pdf
09:58 < elichai2> nickler: Thanks :)
12:42 -!- reallll [~belcher@unaffiliated/belcher] has joined #secp256k1
12:43 -!- belcher [~belcher@unaffiliated/belcher] has quit [Ping timeout: 240 seconds]
13:54 -!- reallll is now known as belcher
--- Log closed Wed Nov 13 00:00:06 2019