On Sat, Jun 18, 2016 at 4:01 PM, Peter Todd <pete@petertodd.org> wrote:

>
>
Have you seen how BLAKE2 omits padding when the data to be hashed happens
> to be
> exactly one block in size? It's significantly faster than SHA256, and
> that's a
> standard part of the algorithm already.
>

That's very convenient! I didn't know it, but had 'look up how blake2 does
padding' in my list of stuff to do. I'm leaning heavily towards using
blake2b at this point, at least for internal hashing.


>
> > At the root there's a branch block. It consists of all nodes up to some
> > fixed depth - let's say 12 - with that depth set so that it roughly fits
> > within a single memory page. Branch blocks are arranged with the nodes in
> > fixed position defined by the prefix they correspond to, and the
> terminals
> > have outpointers to other blocks. Because they're all clustered
> together, a
> > lookup or update will only require a single
>
> A single....?
>

Okay, I've figured out the root cause of general confusion here. It's
mostly my fault.

There are a few different media on which data can be stored, with different
properties in terms of how long it takes to retrieve data from them, and
how much of a readahead they typically have. I was misreading the l2 cache
size as the main memory readahead amount, which is... probably wrong? The
readahead properties of memory aren't well documented and apparently vary a
lot. On SSDs it typically pulls down a kilobyte at once and they call them
pages, hence my use of that term above. But since my real point is that my
implementation should act as a silver bullet which can have acceptable
performance even on extremely bad devices, I'll give an analysis of how
well it works when everything is stored on a regular spinning hard drive.

Let's say you're storing 100 million items, which will fit within 10
gigabytes. If you set the block depths to about 10 bits they'll be about
32K, and if you set the size of leaf blocks to be about the same then
memory efficiency will be good because the leaf blocks will store twigs of
about 2^7 in size while having 2^10 space so they'll fit reasonably. Almost
everything will be three blocks from root, so updates will generally
require two disk seeks (plus one more for a write but those are generally
faster because they get batched.)

For latency numbers, I'm going off these:
https://gist.github.com/jboner/2841832

If the blockchain is very full of simple transactions and a disk seek takes
15 milliseconds, then going with the assumption that a block is about 600
seconds and the blockchain can handle 4 transactions per second and each of
them is 3 updates (one utxo spent plus two new ones created) that's 15 *
600 * 4 * 3 * 2 milliseconds per block, or about 200 seconds per block, so
if the uxto roots trail by a few blocks even a ludicrously underpowered
node could keep up.

On an SSD keeping up is completely trivial, the problem becomes one of how
quickly you can validate an entire blockchain history. There a read takes
about 0.15 milliseconds and you have to do 5 of them instead of 2 because
the natural memory block size is 4k, so it's about 1 millisecond per
update, or 600 * 4 * 3 total time for each block, which is about 7 seconds.
That's large enough that making the utxo root trail by two blocks is still
a good idea, but small enough that it isn't a big factor in the cost of
running a node. It's enough that validating a complete block history might
take a while though, and even validating just the last year would take a
few days. This is very conservative and it's assuming that *everything* is
kept on an SSD though. If the numbers work better and a few layers are kept
in regular memory validating a whole year of history might only take a few
hours.

Hopefully that all makes a fairly good case that raw merkle tree utxo root
trailing by a few blocks is a viable strategy. The data structures in the
MMR proposal are fairly complicated and the analysis of them talks in
somewhat vague terms about things being fast and slow. A similar analysis
of the MMR proposal specifying storage media and expectations of latency
numbers would clarify the reasoning a lot.

(By the way, sorry for the slow response - I got preempted by a bunch of
other work duties.)