To follow up on the remarkable work Greg announced from Benedikt Bünz (Stanford)
and Jonathan Bootle (UCL) on Bulletproofs: https://eprint.iacr.org/2017/1066

Summary
=========

Over the last couple weeks, along with Jonas Nick, Pieter Wuille, Greg Maxwell
and Peter Dettmann, I've implemented the single-output version of Bulletproofs
at https://github.com/ElementsProject/secp256k1-zkp/pull/16 and have some
performance numbers.

All of these benchmarks were performed on one core of an Intel i7-6820MQ
throttled to 2.00Ghz, and reflect verification of a single 64-bit rangeproof.


Old Rangeproof    14.592 ms
     with endo    10.304 ms
Bulletproof        4.208 ms
     with endo     4.031 ms
ECDSA verify       0.117 ms
     with endo     0.084 ms


Here "with endo" refers to use of the GLV endomorphism supported by the curve
secp256k1, which libsecp256k1 (and therefore Bitcoin) supports but does not
enable by default, out of an abundance of caution regarding potential patents.

As we can see, without the endomorphism this reflects a 3.47x speedup over
the verification speed of the old rangeproofs. Because Bulletproof verification
scales with O(N/log(N)) while the old rangeproof scales with O(N), we can
extrapolate forward to say that a 2-output aggregate would verify with 4.10x
the speed of the old rangeproofs.

By the way, even without aggregation, we can verify two rangeproofs nearly 15%
faster than verifying one twice (so a 3.95x speedup) because the nature of the
verification equation makes it amenable to batch verification. This number
improves with the more proofs that you're verifying simultaneously (assuming
you have enough RAM), such that for example you can batch-verify 10000
bulletproofs 9.9 times as fast as you could verify 10000 of the old proofs.


While this is a remarkable speedup which greatly improves the feasibility of
CT for Bitcoin (though still not to the point where I'd expect a serious
proposal to get anywhere, IMHO), the concerns highlighted by Greg regarding
unconditional versus computational soundness remain. I won't expand on that
more than it has already been discussed in this thread, I just want to tamp
down any irrational exhuberance about these result.


People who only care about numbers can stop reading here. What follows is a
discussion about how this speedup is possible and why we weren't initially
sure that we'd get any speedup at all.


Details
=========

Section 6 of the linked preprint discusses performance vs our old rangeproofs. As
Greg mentioned, it is possible to fit two 64-bit bulletproofs into 738 bytes,
with logarithmic scaling. (So one proof would take 674 bytes, but eight proofs
only 866 bytes.)

However, this section does not give performance numbers, because at the time
the preprint was written, there was no optimized implementation on which to
benchmark. It was known that verification time would be roughly linear in the
size of the proof: 141 scalar-multiplies for a 64-bit proof, 270 for an
aggregate of two proofs, and so on [*]. Our old rangeproofs required only 128
multiplies for a 64-bit proof, then 256 for two, and so on. So naively we were
concerned that the new Bulletproofs, despite being fantastically smaller than
the original rangeproofs, might wind up taking a bit longer to verify.

For reference, an ordinary ECDSA signature verification involves 2 multiplies.
So roughly speaking, the naive expectation was that a N-bit rangeproof would
require N-many signature verifications' worth of CPU time, even with this new
research. Worse, we initially expected bulletproofs to require 1.5x this much,
which we avoided with a trick that I'll describe at the end of this mail.

As you can see in the above numbers, the old rangeproofs actually perform worse
than this expectation, while the new Bulletproofs perform significantly **better**.
These are for the same reason: when performing a series of scalar multiplications
of the form

  a*G + b*H + c*I + ...

where G, H, I are curvepoints and a, b, c are scalars, it is possible to compute
this sum much more quickly than simply computing a*G, b*H, c*I separately and
then adding the results. Signature validation takes advantage of this speedup,
using a technique called Strauss' algorithm, to compute the sum of two multiplies
much faster than twice the multiple-speed. Similarly, as we have learned, the
141 scalar-multiplies in a single-output Bulletproof can also be done in a single
sum. To contrast, the old rangeproofs required we do each multiplication separately,
as the result of one would be hashed to determine the multiplier for the next.

libsecp256k1 has supported Strauss' algorithm for two points since its inception
in 2013, since this was needed for ECDSA verification. Extending it to many points
was a nontrivial task which Pieter, Greg and Jonas Nick took on this year as part
of our aggregate signatures project. Of the algorithms that we tested, we found
that Strauss was fastest up to about 100 points, at which point Pippenger's was
fastest. You can see our initial benchmarks here

https://user-images.githubusercontent.com/2582071/32731185-12c0f108-c881-11e7-83c7-c2432b5fadf5.png

though this does not reflect some optimizations from Peter Dettmann in the last
week.


It was a happy coincidence that the Bulletproofs paper was published at nearly
the same time that we had working multi-point code to test with.


Finally, the Bulletproof verification process, as written in the paper, is a
recursive process which does not appear to be expressible as a single multiproduct,
and in fact it appears to require nearly twice as many multiplications as I claim
above. I want to draw attention to two optimizations in particular which made this
possible.

1. By expanding out the recursive process, one can see that the inner-product argument
   (Protocol 1 in the paper) is actually one multiproduct: you hash each (L_i, R_i)
   pair to obtain logarithmically many scalars, invert these, and then each scalar in
   the final multiproduct is a product containing either the inverse or original of
   each scalar.

   Peter Dettmann found a way to reduce this to one scalar inversion, from which
   every single scalar was obtainable from a single multiplication or squaring of a
   previous result. I was able to implement this in a way that cached only log-many
   previous results.

2. Next, line (62) of the Bulletproofs paper appears to require N multiplications
   beyond the 2N multiplications already done in the recursive step. But since
   these multiplications used the same basepoints that were used in the recursive
   step, we could use the distributive property to combine them. This sounds
   trivial but took a fair bit of care to ensure that all the right data was still
   committed to at the right stage of proof verification.



Further Work
=========

There are still a few open issues I plan to help resolve in the coming month:

  - Bulletproof aggregation is not compatible with Confidential Assets, where each
    output has a unique asset tag associated with it. There are a couple possible
    solutions to this but nothing public-ready.

  - Bulletproofs, as described in the paper, work only when proving 2^n-many bits.
    I believe there is a straightforward and verifier-efficient way to extend it
    to support non-powers-of-2, but this requires some work to modify the proof in
    the paper.

  - Bulletproofs are actually much more general than rangeproofs. They can be used
    to prove results of arbitrary arithmetic circuits, which is something we are
    very interested in implementing.


[*] By "and so on", I mean that N bits require 2N + 2log_2(N) + 6 scalar multiplies.


Cheers
Andrew



-- 
Andrew Poelstra
Mathematics Department, Blockstream
Email: apoelstra at wpsoftware.net
Web:   https://www.wpsoftware.net/andrew

"A goose alone, I suppose, can know the loneliness of geese
 who can never find their peace,
 whether north or south or west or east"
       --Joanna Newsom