--- Log opened Tue Feb 08 00:00:53 2022 05:19 < real_or_random> I'd be happy to hear thoughts about https://github.com/bitcoin-core/secp256k1/issues/1065#issuecomment-1027878787 06:17 -!- elsirion [~quassel@gateway/tor-sasl/elsirion] has quit [Quit: No Ping reply in 180 seconds.] 06:19 -!- elsirion [~quassel@gateway/tor-sasl/elsirion] has joined #secp256k1 06:57 -!- hg [~halosghos@user/halosghost] has joined #secp256k1 07:02 -!- elsirion [~quassel@gateway/tor-sasl/elsirion] has quit [Remote host closed the connection] 07:02 -!- elsirion [~quassel@gateway/tor-sasl/elsirion] has joined #secp256k1 09:57 < sipa> @kaushik You had some question about secp256k1_gej_add_ge ? 09:59 < kaushik> In the function secp256k1_gej_add_ge within the file group_impl.h there is a sequence of the following functions. 09:59 < kaushik> secp256k1_fe_sqr(&rr, &t); 09:59 < kaushik> secp256k1_fe_negate(&m_alt, &u2, 1); 09:59 < kaushik> secp256k1_fe_mul(&tt, &u1, &m_alt); 09:59 < kaushik> secp256k1_fe_add(&rr, &tt); 10:00 < kaushik> If I avoid the negation and do a subtraction instead of addition in the end, should I get the same result? 10:02 < sipa> Sounds plasubile (if the m_alt value isn't used in more later) though currently the field interface doesn't offer any subtraction function. 10:02 < sipa> Do you think that having a field subtraction function would be worth having? 10:03 < kaushik> I would like to apply it in my 4-way vectorized implementation. 10:04 < sipa> Note that there are a few open PRs which simplify the gej_add_ge function already. 10:04 < sipa> And there was a comment recently about simplifying it further. 10:05 < sipa> See https://github.com/bitcoin-core/secp256k1/pull/1033 10:07 < hg> y 10:07 < hg> s/y// 10:08 < sipa> @kaushik You may want to check if the changes there aren't also beneficial for your vectorized code 10:15 < kaushik> I think I will first develop the vectorized code using the existing implementation of the function secp256k1_gej_add_ge. Once I have a working module it can be altered further without much problem. 10:15 < sipa> Okay. 10:15 < sipa> In any case, if it's just for the vectorized code that negation is useful, go for it, it won't impact much other code. 10:16 < sipa> But perhaps it's also worth investigating if it can be more generally applied beneficially. 10:17 < kaushik> It seems to me avoiding the negation followed by a subtraction after the multiplication would be better in the vectorized setup. 10:19 < kaushik> Later on there is a function call secp256k1_fe_add(&m_alt, &u1) which modifies the value of m_alt. I think it can be done by simply subtracting u2 from u1, right? 10:19 < sipa> But subtraction is more expensive than addition, I would think. 10:20 < sipa> No? 10:20 < sipa> It may not be the case in a 4x64 or 5x64 field representation. 10:20 < sipa> In 5x52 it is. 10:23 < kaushik> Am doing a 10x26 implementation using 4-way AVX2 instructions. 10:23 < kaushik> https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=vpaddq&ig_expand=7401,5235,6374,488,6979,6979,115,6979,115&techs=AVX2 10:23 < kaushik> https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=vpsubq&ig_expand=7401,5235,6374,488,6979,6979,115,6979&techs=AVX2 10:24 < kaushik> Check these two for the latencies of vpaddq and vpsubq. They are same. 10:26 < sipa> Yes, sure, but you need an overflow/borrom mod p. I would imagine that subtraction is a bit more complicated. 10:27 < sipa> For the existing 5x52 field it is, because addition is just field-wise addition without carry, but subtraction involved also adding a multiple of p. 10:27 < kaushik> To perform a-b, we can do through a+2p-b 10:27 < sipa> I guess the difference is much smaller for 4x64. I'm not sure about 5x64. 10:28 < sipa> Right, but isn't a+2p-b more instructions to do than a+b ? 10:30 < kaushik> It is a-b. 10:30 < sipa> I'm not sure what you're trying to say. 10:32 < sipa> The 10x26 field implementation right now uses ((n+1)*p - a) to compute -a, where n is the magnitude of a (= how much the field values can exceed 2^26) 10:35 < kaushik> So in the negation we have 10 limb-muls and 10 limb-subs. After that while doing the addition we have 10 more limb-adds which makes a total of 30 linear operations to perform a-b, right? 10:37 < kaushik> Sorry, mul is not linear. But, still we have 30 limb operations involved. 10:37 < sipa> Most of the limb-muls are shared (because it's just (n+1)*(2^26-1)) 10:38 < kaushik> Agreed. 10:38 < kaushik> But, still we have limb 20+ operations. 10:38 < sipa> Assuming those constants are relatively cheap, there are only 10 ops for computing a negation. 10:39 < sipa> for i in 0..9: 10:39 < sipa> out[i] = (precomputed value) - in[i] 10:40 < kaushik> If we compute using a+2p-b we simply have 10-limb additions and 10 limb subtractions. In the vectorized setup since the latencies of vpaddq and vpsubq are same it seems to be better. 10:40 < sipa> You can't use 2p without knowing the magnitude of b. 10:40 < sipa> b may be larger than p 10:40 < sipa> unless you normalize it first 10:41 < kaushik> We can't do it by less than 20 operations, I guess. 10:41 < sipa> Right. 10:41 < kaushik> I think using 4p and 8p atmost would suffice the purpose. 10:42 < sipa> Look at the existing code! 10:42 < sipa> It already has these magnitudes. 10:42 < sipa> Every call to fe_negate takes as input the maximum magnitude of the argument. 10:43 < sipa> My point is this: I can believe that a field subtraction can be made more efficient than a field negation + field addition. 10:43 < sipa> But I would be surprised if 2 field subtractions with the same second argument are faster than a negation plus two additions. 10:50 < kaushik> I think the difference will actually arise due to the use of precomputed values of n*p. Since the function for negation is a general high-level function it is incurring more cost. 10:51 < sipa> I'm almost certain it gets inlined, and the values are precomputed by the compiler. 10:52 < kaushik> Then, I suppose the difference wont be much. 10:53 < sipa> In any case, no need to stick to the same generic function approach. 10:53 < sipa> I just mean: we already know exactly what multiples of p to add for every negation call site. 10:55 < kaushik> Am not sticking to it. So far I have investigated, while developing the vectorized assembly implementation of the function secp256k1_gej_double it seems to me using 2p, 4p and 8p while doing the subtractions would suffice the purpose. I believe, for the function secp256k1_gej_add_ge things would be similar. 10:56 < sipa> Ok, it may be useful to do the same for the non-vector code. 10:57 < kaushik> Might be. 10:57 < kaushik> Please confirm me about a few more things. 10:58 < kaushik> The value of the variable "degenerate" is binary within the function secp256k1_gej_add_ge, right? 10:58 < sipa> I believe so. 11:00 < kaushik> So, in 4-way implementation we will have 4 different values of this variable, right? 11:00 < kaushik> secp256k1_fe_cmov(&rr_alt, &rr, !degenerate); 11:00 < kaushik> secp256k1_fe_cmov(&m_alt, &m, !degenerate); 11:00 < sipa> indeed 11:01 < kaushik> So, in these two functions while implementing the 4-way conditional move some moves should actually occur and some should not, right? 11:02 < sipa> indeed 11:05 < kaushik> In the function secp256k1_ecmult_const within https://github.com/bitcoin-core/secp256k1/blob/master/src/ecmult_const_impl.h we will have 4 different values of a and scalar. So, we will have 4 different tables for pre_a, pre_a_lam, wnaf_1, wnaf_lam, right? 11:06 < kaushik> Also, we will have 4 different values of skew_1 and skew_lam, right? 11:08 < sipa> Right, every variable will exist 4 times, with independent values. 11:09 < kaushik> I am considering the value of the parameter "size" to be same as 256 for all the 4 instances, okay? 11:10 < sipa> Is it every called with a value other than 256 in the current codebase? 11:12 < kaushik> The argument is hardcoded to 256. 11:12 < kaushik> secp256k1_ecmult_const(&res, &pt, &s, 256); 11:12 < kaushik> In https://github.com/bitcoin-core/secp256k1/blob/master/src/modules/ecdh/main_impl.h 11:12 < sipa> Sure, then. 11:13 < kaushik> Okay. 11:16 < kaushik> There is a portion /* Correct for wNAF skew */ within the function secp256k1_ecmult_const. There is a group operation secp256k1_gej_add_ge(r, r, &correction) involved here. 11:16 < kaushik> The value of the ge correction should be same for all the four instances, right? 11:16 < sipa> No, it depends on the scalar. 11:17 < kaushik> Is it? Let me check. 11:17 < sipa> It can be equal to P or 2P, if I recall correctly. 11:18 < sipa> But some of that code changed recently too. I may be wrong. 11:19 < kaushik> It is not independent. But, I think it depends on the point "a" instead of "scalar". 11:20 < sipa> Right, it certainly depends on the point. 11:22 < kaushik> Thanks Pieter. That's all for now, I suppose. 11:23 < sipa> Cool! 11:23 < kaushik> I will work on the assembly implementation of secp256k1_gej_add_ge and complete it. It is more challenging than the function secp256k1_gej_double. 15:51 -!- hg [~halosghos@user/halosghost] has quit [Quit: WeeChat 3.4] --- Log closed Wed Feb 09 00:00:54 2022