Re: [bitcoin-dev] A Better MMR Definition

public inbox for bitcoindev@googlegroups.com
 help / color / mirror / Atom feed

From: Peter Todd <pete@petertodd•org>
To: Bram Cohen <bram@bittorrent•com>
Cc: Bitcoin Protocol Discussion <bitcoin-dev@lists•linuxfoundation.org>
Subject: Re: [bitcoin-dev] A Better MMR Definition
Date: Fri, 24 Feb 2017 23:12:02 -0500	[thread overview]
Message-ID: <20170225041202.GA11152@savin.petertodd.org> (raw)
In-Reply-To: <CA+KqGkpi4GvgU-K6vt-U5ZN4AkpjZ0rruzddoJS4-V0TcnyqUQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 5105 bytes --]

On Fri, Feb 24, 2017 at 02:20:19PM -0800, Bram Cohen wrote:
> So your idea is to cluster entries by entry time because newer things are
> more likely to leave and updating multiple things near each other is
> cheaper?

Yes, exactly.

> That can be done with my tool. Instead of using hashes for the values being
> stored, you use position entries. The first entry gets a value of all
> zeros, the next one a one followed by all zeros, then the next two
> correspond to the first two with the second bit flipped to one, then the
> next four the first four with the third bit flipped to one, etc. It
> probably performs a little bit better to do it two bits at a time instead
> of one so that the entries are 00, 01, 10, 11, 0001, 0010, 0011, 0101,
> 0110, 0111, 1001, etc. If you were to really use this you'd probably want
> to to add some optimizations to use the fact that the terminals fit in 64
> bits instead of 256, but it mostly works unchanged, and gets whatever

So to be clear, what you're proposing there is to use the insertion order as
the index - once you go that far you've almost entirely re-invented my
proposal!

In fact, when I was working my proofchains/proofmarshal libraries I put some
thought into whether or not I could leave out the MMR merkelized list
implementation and use only the key-value map I also wrote. I decided to
include both as they aren't quite the same datastructure - using a list for a
list has advantages.

> benefits there are to this clustering plus the high performance
> implementation tricks I've built which I keep complaining that nobody's
> giving feedback on.

Your merkle-set implementation is 1500 lines of densely written Python with
almost no comments, and less than a 100 lines of (also uncommented) tests. By
comparison, my Python MMR implementation is 300 lines of very readable Python
with lots of comments, a 200 line explanation at the top, and 200 lines of
(commented) tests. Yet no-one is taking the (still considerable) effort to
understand and comment on my implementation. :)

Fact is, what you've written is really daunting to review, and given it's not
in the final language anyway, it's unclear what basis to review it on anyway. I
suspect you'd get more feedback if the codebase was better commented, in a
production language, and you have actual real-world benchmarks and performance
figures.

In particular, while at the top of merkle_set.py you have a list of advantages,
and a bunch of TODO's, you don't explain *why* the code has any of these
advantages. To figure that out, I'd have to read and understand 1500 lines of
densely written Python. Without a human-readable pitch, not many people are
going to do that, myself included.

> I'm not sold on this being a win: The empirical access patterns are
> unknown,

Lost coins alone guarantees that access patterns will be biased towards new
coins being more likely to be spent. That basis alone is sufficient to justify
an insertion-ordered data structure. Additionally, people have done graphs of
the average age of UTXO's when spent, and that data clearly shows that newer
coins are more likely to be spent than older coins.

> unknown, it requires an extra cache miss per lookup to find the entry
> number,

Like I mentioned in the email you're replying to, that extra lookup can be
easily avoided with a change to how transactions/blocks are serialized; if all
your peers support TXO commitments you can even discard the lookup database
entirely, as it's only a backwards compatibility measure.

> it may be that everything is optimized well enough without it for
> there to be no meaningful gains, and it's a bunch of extra complexity. What

Optimization is itself extra complexity. If you're data structure has worse
inherent performance, and you have to make up the different with a highly
optimized implementation, that's likely to lead to more overall complexity than
using a data structure with better inherent performance.

Your current merkle-set implementation definitely _is_ very complex. An
apples-to-apples comparison is with my merkelized key:value tree(1), also a
patricia tree, which like the MMR is only about 300 lines of well-commented and
straight-forward code.

1) https://github.com/proofchains/python-proofmarshal/blob/master/proofmarshal/merbinnertree.py

> should be done is that a plain vanilla UTXO set solution is optimized as
> well as it can be first, and then the insertion ordering trick is tried as
> an optimization to see if it's an improvement. Without that baseline
> there's no meaningful basis for comparison, and I'm quite confident that a
> naive implementation which just allocates individual nodes will
> underperform the thing I've come up with, even without adding optimizations
> related to fitting in 64 bits.

To be clear, "insertion ordering" isn't a simple trick, it's a fundamental
change to what the data structure is. Once you do that, you're talking about my
proposal.

-- 
https://petertodd.org 'peter'[:-1]@petertodd.org

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

next prev parent reply	other threads:[~2017-02-25  4:12 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-23  1:15 Peter Todd
2017-02-23  3:07 ` Bram Cohen
2017-02-23  7:41   ` Peter Todd
2017-02-23 17:53 ` Chris Priest
2017-02-23 18:19   ` Peter Todd
2017-02-23 18:28     ` G. Andrew Stone
2017-02-23 18:31       ` Peter Todd
2017-02-23 23:13   ` Bram Cohen
2017-02-23 23:51     ` Peter Todd
2017-02-24  0:49       ` Bram Cohen
2017-02-24  1:09         ` Peter Todd
2017-02-24  2:50           ` Bram Cohen
2017-02-24  2:58             ` Peter Todd
2017-02-24  3:02               ` Bram Cohen
2017-02-24  3:15                 ` Peter Todd
2017-02-24  3:32                   ` Bram Cohen
2017-02-24  4:36                     ` Peter Todd
2017-02-24 22:20                       ` Bram Cohen
2017-02-25  4:12                         ` Peter Todd [this message]
2017-02-25  6:23                           ` Bram Cohen
2017-02-28 16:43                             ` G. Andrew Stone
2017-02-28 23:10                               ` Bram Cohen
2017-02-28 23:24                                 ` Pieter Wuille
2017-03-01  1:47                                   ` Bram Cohen
2017-03-01  1:56                                     ` Peter Todd
2017-03-01 22:31                             ` Peter Todd
2017-03-31 20:38                               ` Bram Cohen
2017-04-01 10:18                                 ` praxeology_guy
2017-04-01 19:46                                   ` praxeology_guy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170225041202.GA11152@savin.petertodd.org \
    --to=pete@petertodd$(echo .)org \
    --cc=bitcoin-dev@lists$(echo .)linuxfoundation.org \
    --cc=bram@bittorrent$(echo .)com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox