--- Log opened Tue Feb 08 00:00:53 2022
05:19 < real_or_random> I'd be happy to hear thoughts about https://github.com/bitcoin-core/secp256k1/issues/1065#issuecomment-1027878787
06:17 -!- elsirion [~quassel@gateway/tor-sasl/elsirion] has quit [Quit: No Ping reply in 180 seconds.]
06:19 -!- elsirion [~quassel@gateway/tor-sasl/elsirion] has joined #secp256k1
06:57 -!- hg [~halosghos@user/halosghost] has joined #secp256k1
07:02 -!- elsirion [~quassel@gateway/tor-sasl/elsirion] has quit [Remote host closed the connection]
07:02 -!- elsirion [~quassel@gateway/tor-sasl/elsirion] has joined #secp256k1
09:57 < sipa> @kaushik You had some question about secp256k1_gej_add_ge ?
09:59 < kaushik> In the function secp256k1_gej_add_ge within the file group_impl.h there is a sequence of the following functions.
09:59 < kaushik> secp256k1_fe_sqr(&rr, &t);
09:59 < kaushik> secp256k1_fe_negate(&m_alt, &u2, 1);
09:59 < kaushik> secp256k1_fe_mul(&tt, &u1, &m_alt);
09:59 < kaushik> secp256k1_fe_add(&rr, &tt);     
10:00 < kaushik> If I avoid the negation and do a subtraction instead of addition in the end, should I get the same result?
10:02 < sipa> Sounds plasubile (if the m_alt value isn't used in more later) though currently the field interface doesn't offer any subtraction function.
10:02 < sipa> Do you think that having a field subtraction function would be worth having?
10:03 < kaushik> I would like to apply it in my 4-way vectorized implementation.
10:04 < sipa> Note that there are a few open PRs which simplify the gej_add_ge function already.
10:04 < sipa> And there was a comment recently about simplifying it further.
10:05 < sipa> See https://github.com/bitcoin-core/secp256k1/pull/1033
10:07 < hg> y
10:07 < hg> s/y//
10:08 < sipa> @kaushik You may want to check if the changes there aren't also beneficial for your vectorized code
10:15 < kaushik> I think I will first develop the vectorized code using the existing implementation of the function secp256k1_gej_add_ge. Once I have a working module it can be altered further without much problem.
10:15 < sipa> Okay.
10:15 < sipa> In any case, if it's just for the vectorized code that negation is useful, go for it, it won't impact much other code.
10:16 < sipa> But perhaps it's also worth investigating if it can be more generally applied beneficially.
10:17 < kaushik> It seems to me avoiding the negation followed by a subtraction after the multiplication would be better in the vectorized setup.
10:19 < kaushik> Later on there is a function call secp256k1_fe_add(&m_alt, &u1) which modifies the value of m_alt. I think it can be done by simply subtracting u2 from u1, right?
10:19 < sipa> But subtraction is more expensive than addition, I would think.
10:20 < sipa> No?
10:20 < sipa> It may not be the case in a 4x64 or 5x64 field representation.
10:20 < sipa> In 5x52 it is.
10:23 < kaushik> Am doing a 10x26 implementation using 4-way AVX2 instructions. 
10:23 < kaushik> https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=vpaddq&ig_expand=7401,5235,6374,488,6979,6979,115,6979,115&techs=AVX2
10:23 < kaushik> https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=vpsubq&ig_expand=7401,5235,6374,488,6979,6979,115,6979&techs=AVX2
10:24 < kaushik> Check these two for the latencies of vpaddq and vpsubq. They are same.
10:26 < sipa> Yes, sure, but you need an overflow/borrom mod p. I would imagine that subtraction is a bit more complicated.
10:27 < sipa> For the existing 5x52 field it is, because addition is just field-wise addition without carry, but subtraction involved also adding a multiple of p.
10:27 < kaushik> To perform a-b, we can do through a+2p-b
10:27 < sipa> I guess the difference is much smaller for 4x64. I'm not sure about 5x64.
10:28 < sipa> Right, but isn't a+2p-b more instructions to do than a+b ?
10:30 < kaushik> It is a-b.
10:30 < sipa> I'm not sure what you're trying to say.
10:32 < sipa> The 10x26 field implementation right now uses ((n+1)*p - a) to compute -a, where n is the magnitude of a (= how much the field values can exceed 2^26)
10:35 < kaushik> So in the negation we have 10 limb-muls and 10 limb-subs. After that while doing the addition we have 10 more limb-adds which makes a total of 30 linear operations to perform a-b, right? 
10:37 < kaushik> Sorry, mul is not linear. But, still we have 30 limb operations involved.
10:37 < sipa> Most of the limb-muls are shared (because it's just (n+1)*(2^26-1))
10:38 < kaushik> Agreed.
10:38 < kaushik> But, still we have limb 20+ operations.
10:38 < sipa> Assuming those constants are relatively cheap, there are only 10 ops for computing a negation.
10:39 < sipa> for i in 0..9:
10:39 < sipa>   out[i] = (precomputed value) - in[i]
10:40 < kaushik> If we compute using a+2p-b we simply have 10-limb additions and 10 limb subtractions. In the vectorized setup since the latencies of vpaddq and vpsubq are same it seems to be better.
10:40 < sipa> You can't use 2p without knowing the magnitude of b.
10:40 < sipa> b may be larger than p
10:40 < sipa> unless you normalize it first
10:41 < kaushik> We can't do it by less than 20 operations, I guess.
10:41 < sipa> Right.
10:41 < kaushik> I think using 4p and 8p atmost would suffice the purpose.
10:42 < sipa> Look at the existing code!
10:42 < sipa> It already has these magnitudes.
10:42 < sipa> Every call to fe_negate takes as input the maximum magnitude of the argument.
10:43 < sipa> My point is this: I can believe that a field subtraction can be made more efficient than a field negation + field addition.
10:43 < sipa> But I would be surprised if 2 field subtractions with the same second argument are faster than a negation plus two additions.
10:50 < kaushik> I think the difference will actually arise due to the use of precomputed values of n*p. Since the function for negation is a general high-level function it is incurring more cost.  
10:51 < sipa> I'm almost certain it gets inlined, and the values are precomputed by the compiler.
10:52 < kaushik> Then, I suppose the difference wont be much.
10:53 < sipa> In any case, no need to stick to the same generic function approach.
10:53 < sipa> I just mean: we already know exactly what multiples of p to add for every negation call site.
10:55 < kaushik> Am not sticking to it. So far I have investigated, while developing the vectorized assembly implementation of the function secp256k1_gej_double it seems to me using 2p, 4p and 8p while doing the subtractions would suffice the purpose. I believe, for the function secp256k1_gej_add_ge things would be similar. 
10:56 < sipa> Ok, it may be useful to do the same for the non-vector code.
10:57 < kaushik> Might be.
10:57 < kaushik> Please confirm me about a few more things.
10:58 < kaushik> The value of the variable "degenerate" is binary within the function secp256k1_gej_add_ge, right?
10:58 < sipa> I believe so.
11:00 < kaushik> So, in 4-way implementation we will have 4 different values of this variable, right?
11:00 < kaushik>     secp256k1_fe_cmov(&rr_alt, &rr, !degenerate);
11:00 < kaushik>     secp256k1_fe_cmov(&m_alt, &m, !degenerate);
11:00 < sipa> indeed
11:01 < kaushik> So, in these two functions while implementing the 4-way conditional move some moves should actually occur and some should not, right?
11:02 < sipa> indeed
11:05 < kaushik> In the function secp256k1_ecmult_const within https://github.com/bitcoin-core/secp256k1/blob/master/src/ecmult_const_impl.h we will have 4 different values of a and scalar.  So, we will have 4 different tables for pre_a, pre_a_lam, wnaf_1, wnaf_lam, right?
11:06 < kaushik> Also, we will have 4 different values of skew_1 and skew_lam, right?
11:08 < sipa> Right, every variable will exist 4 times, with independent values.
11:09 < kaushik> I am considering the value of the parameter "size" to be same as 256 for all the 4 instances, okay?
11:10 < sipa> Is it every called with a value other than 256 in the current codebase?
11:12 < kaushik> The argument is hardcoded to 256. 
11:12 < kaushik> secp256k1_ecmult_const(&res, &pt, &s, 256);
11:12 < kaushik> In https://github.com/bitcoin-core/secp256k1/blob/master/src/modules/ecdh/main_impl.h
11:12 < sipa> Sure, then.
11:13 < kaushik> Okay.
11:16 < kaushik> There is a portion /* Correct for wNAF skew */ within the function secp256k1_ecmult_const. There is a group operation secp256k1_gej_add_ge(r, r, &correction) involved here.
11:16 < kaushik> The value of the ge correction should be same for all the four instances, right?
11:16 < sipa> No, it depends on the scalar.
11:17 < kaushik> Is it? Let me check.
11:17 < sipa> It can be equal to P or 2P, if I recall correctly.
11:18 < sipa> But some of that code changed recently too. I may be wrong.
11:19 < kaushik> It is not independent. But, I think it depends on the point "a" instead of "scalar".
11:20 < sipa> Right, it certainly depends on the point.
11:22 < kaushik> Thanks Pieter. That's all for now, I suppose.
11:23 < sipa> Cool!
11:23 < kaushik> I will work on the assembly implementation of secp256k1_gej_add_ge and complete it. It is more challenging than the function secp256k1_gej_double.
15:51 -!- hg [~halosghos@user/halosghost] has quit [Quit: WeeChat 3.4]
--- Log closed Wed Feb 09 00:00:54 2022