Re: [bitcoin-dev] A Better MMR Definition

public inbox for bitcoindev@googlegroups.com
 help / color / mirror / Atom feed

From: Bram Cohen <bram@bittorrent•com>
To: Peter Todd <pete@petertodd•org>
Cc: Bitcoin Protocol Discussion <bitcoin-dev@lists•linuxfoundation.org>
Subject: Re: [bitcoin-dev] A Better MMR Definition
Date: Fri, 24 Feb 2017 14:20:19 -0800	[thread overview]
Message-ID: <CA+KqGkpi4GvgU-K6vt-U5ZN4AkpjZ0rruzddoJS4-V0TcnyqUQ@mail.gmail.com> (raw)
In-Reply-To: <20170224043613.GA32502@savin.petertodd.org>

[-- Attachment #1: Type: text/plain, Size: 7156 bytes --]

So your idea is to cluster entries by entry time because newer things are
more likely to leave and updating multiple things near each other is
cheaper?

That can be done with my tool. Instead of using hashes for the values being
stored, you use position entries. The first entry gets a value of all
zeros, the next one a one followed by all zeros, then the next two
correspond to the first two with the second bit flipped to one, then the
next four the first four with the third bit flipped to one, etc. It
probably performs a little bit better to do it two bits at a time instead
of one so that the entries are 00, 01, 10, 11, 0001, 0010, 0011, 0101,
0110, 0111, 1001, etc. If you were to really use this you'd probably want
to to add some optimizations to use the fact that the terminals fit in 64
bits instead of 256, but it mostly works unchanged, and gets whatever
benefits there are to this clustering plus the high performance
implementation tricks I've built which I keep complaining that nobody's
giving feedback on.

I'm not sold on this being a win: The empirical access patterns are
unknown, it requires an extra cache miss per lookup to find the entry
number, it may be that everything is optimized well enough without it for
there to be no meaningful gains, and it's a bunch of extra complexity. What
should be done is that a plain vanilla UTXO set solution is optimized as
well as it can be first, and then the insertion ordering trick is tried as
an optimization to see if it's an improvement. Without that baseline
there's no meaningful basis for comparison, and I'm quite confident that a
naive implementation which just allocates individual nodes will
underperform the thing I've come up with, even without adding optimizations
related to fitting in 64 bits.

On Thu, Feb 23, 2017 at 8:36 PM, Peter Todd <pete@petertodd•org> wrote:

> On Thu, Feb 23, 2017 at 07:32:43PM -0800, Bram Cohen wrote:
> > On Thu, Feb 23, 2017 at 7:15 PM, Peter Todd <pete@petertodd•org> wrote:
> >
> > >
> > > Glad we're on the same page with regard to what's possible in TXO
> > > commitments.
> > >
> > > Secondly, am I correct in saying your UTXO commitments scheme requires
> > > random
> > > access? While you describe it as a "merkle set", obviously to be
> merkelized
> > > it'll have to have an ordering of some kind. What do you propose that
> > > ordering
> > > to be?
> > >
> >
> > The ordering is by the bits in the hash. Technically it's a Patricia
> Trie.
> > I'm using 'merkle tree' to refer to basically anything with a hash root.
>
> The hash of what? The values in the set?
>
> > > Maybe more specifically, what exact values do you propose to be in the
> set?
> > >
> > >
> > That is unspecified in the implementation, it just takes a 256 bit value
> > which is presumably a hash of something. The intention is to nail down a
> > simple format and demonstrate good performance and leave those semantics
> to
> > a higher layer. The simplest thing would be to hash together the txid and
> > output number.
>
> Ok, so let's assume the values in the set are the unspent outpoints.
>
> Since we're ordering by the hash of the values in the set, outpoints will
> be
> distributed uniformly in the set, and thus the access pattern of data in
> the
> set is uniform.
>
> Now let's fast-forward 10 years. For the sake of argument, assume that for
> every 1 UTXO in the set that corresponds to funds in someone's wallet that
> are
> likely to be spent, there are 2^12 = 4096 UTXO's that have been permanently
> lost (and/or created in spam attacks) and thus will never be spent.
>
> Since lost UTXO's are *also* uniformly distributed, if I'm processing a new
> block that spends 2^12 = 4096 UTXO's, on average for each UTXO spent, I'll
> have to update log2(4096) = 12 more digests than I would have had those
> "dead"
> UTXO's not existed.
>
> Concretely, imagine our UTXO set had just 8 values in it, and we were
> updating
> two of them:
>
>                #
>               / \
>              /   \
>             /     \
>            /       \
>           /         \
>          #           #
>         / \         / \
>        /   \       /   \
>       #     .     .     #
>      / \   / \   / \   / \
>     .   X .   . .   . X   .
>
> To mark two coins as spent, we've had to update 5 inner nodes.
>
>
> Now let's look at what happens in an insertion-ordered TXO commitment
> scheme.
> For sake of argument, let's assume the best possible case, where every UTXO
> spent in that same block was recently created. Since the UTXO's are
> recently
> created, chances are almost every single one of those "dead" UTXO's will
> have
> been created in the past. Thus, since this is an insertion-ordered data
> structure, those UTXO's exist in an older part of the data structure that
> our
> new block doesn't need to modify at all.
>
> Concretely, again let's imagine a TXO commitment with 8 values in it, and
> two
> of them being spent:
>
>                #
>               / \
>              /   \
>             /     \
>            /       \
>           /         \
>          .           #
>         / \         / \
>        /   \       /   \
>       .     .     .     #
>      / \   / \   / \   / \
>     .   . .   . .   . X   X
>
> To mark two coins as spent, we've only had to update 3 inner nodes; while
> our
> tree is higher with those lost coins, those extra inner nodes are amortised
> across all the coins we have to update.
>
>
> The situation gets even better when we look at the *new* UTXO's that our
> block
> creates. Suppose our UTXO set has size n. To mark a single coin as spent,
> we
> have to update log2(n) inner nodes. We do get to amortise this a bit at
> the top
> levels in the tree, but even if we assume the amortisation is totally free,
> we're updating at least log2(n) - log2(m) inner nodes "under" the amortised
> nodes at the top of the tree for *each* new node.
>
> Meanwhile with an insertion-ordered TXO commitment, each new UTXO added to
> the
> data set goes in the same place - the end. So almost none of the existing
> data
> needs to be touched to add the new UTXOs. Equally, the hashing required
> for the
> new UTXO's can be done in an incremental fashion that's very L1/L2 cache
> friendly.
>
>
> tl;dr: Precisely because access patterns in TXO commitments are *not*
> uniform,
> I think we'll find that from a L1/L2/etc cache perspective alone, TXO
> commitments will result in better performance than UTXO commitments.
>
>
> Now it is true that Bitcoin's current design means we'll need a map of
> confirmed outpoints to TXO insertion order indexes. But it's not
> particularly
> hard to add that "metadata" to transactions on the P2P layer in the same
> way
> that segwit added witnesses to transactions without modifying how txids
> were
> calculated; if you only connect to peers who provide you with TXO index
> information in blocks and transactions, you don't need to keep that map
> yourself.
>
> Finally, note how this makes transactions *smaller* in many circumstances:
> it's
> just a 8-byte max index rather than a 40 byte outpoint.
>
> --
> https://petertodd.org 'peter'[:-1]@petertodd.org
>

[-- Attachment #2: Type: text/html, Size: 8700 bytes --]

next prev parent reply	other threads:[~2017-02-24 22:20 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-23  1:15 Peter Todd
2017-02-23  3:07 ` Bram Cohen
2017-02-23  7:41   ` Peter Todd
2017-02-23 17:53 ` Chris Priest
2017-02-23 18:19   ` Peter Todd
2017-02-23 18:28     ` G. Andrew Stone
2017-02-23 18:31       ` Peter Todd
2017-02-23 23:13   ` Bram Cohen
2017-02-23 23:51     ` Peter Todd
2017-02-24  0:49       ` Bram Cohen
2017-02-24  1:09         ` Peter Todd
2017-02-24  2:50           ` Bram Cohen
2017-02-24  2:58             ` Peter Todd
2017-02-24  3:02               ` Bram Cohen
2017-02-24  3:15                 ` Peter Todd
2017-02-24  3:32                   ` Bram Cohen
2017-02-24  4:36                     ` Peter Todd
2017-02-24 22:20                       ` Bram Cohen [this message]
2017-02-25  4:12                         ` Peter Todd
2017-02-25  6:23                           ` Bram Cohen
2017-02-28 16:43                             ` G. Andrew Stone
2017-02-28 23:10                               ` Bram Cohen
2017-02-28 23:24                                 ` Pieter Wuille
2017-03-01  1:47                                   ` Bram Cohen
2017-03-01  1:56                                     ` Peter Todd
2017-03-01 22:31                             ` Peter Todd
2017-03-31 20:38                               ` Bram Cohen
2017-04-01 10:18                                 ` praxeology_guy
2017-04-01 19:46                                   ` praxeology_guy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+KqGkpi4GvgU-K6vt-U5ZN4AkpjZ0rruzddoJS4-V0TcnyqUQ@mail.gmail.com \
    --to=bram@bittorrent$(echo .)com \
    --cc=bitcoin-dev@lists$(echo .)linuxfoundation.org \
    --cc=pete@petertodd$(echo .)org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox