[Bitcoin-development] BIP proposal: Authenticated prefix trees

public inbox for bitcoindev@googlegroups.com
 help / color / mirror / Atom feed

* [Bitcoin-development] BIP proposal: Authenticated prefix trees
@ 2013-12-20  1:47 Mark Friedenbach
  2013-12-20  6:48 ` Jeremy Spilman
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Mark Friedenbach @ 2013-12-20  1:47 UTC (permalink / raw)
  To: Bitcoin Dev

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello fellow bitcoin developers. Included below is the first draft of
a BIP for a new Merkle-compressed data structure. The need for this
data structure arose out of the misnamed "Ultimate blockchain
compression" project, but it has since been recognized to have many
other applications.

In addition to this BIP I am preparing three additional BIPs
describing the use of this data structure in stateless validation &
mining, the UBC address index for "SPV+" operating modes, document
timestamping and merged mining.

A Python implementation of this data structure is available here:

https://github.com/monetizeio/python-bitcoin

A C++ implementation is being worked on.

As per the BIP-1 procedure, I am submitting this rough draft to the
community for discussion. I welcome all comments and criticisms of
both form and content.

- -Mark

==Abstract==

This BIP describes a [http://en.wikipedia.org/wiki/Hash_tree Merkle
hash tree] variant of the [http://en.wikipedia.org/wiki/Trie
prefix-tree data structure], ideally suited for encoding key-value
indices which support memory-efficient proofs.

==Motivation==

There are a number of applications which would benefit from having a
data structure with the following properties:

* '''Arbitrary mapping of keys to values.''' A ''key'' can be any
bytestring, and its ''value'' any other bytestring.
* '''Duplicate keys disallowed.''' Every key has one, and only one
value associated with it. Some applications demand assurance that no
key value is reused, and that this constraint can be checked without
requiring access to the entire data structure.
* '''Efficient look-up by key.''' The data structure should support
sub-linear lookup operations with respect to the number of keys in the
mapping. Logarithmic time or linear with respect to the length of the
key should be achievable and would be sufficient for realistic
applications.
* '''Merkle compression of mapping structure.''' It should be possible
to produce a reduced description of the tree consisting of a single
root hash value which is deterministically calculated from the mapping
structure.
* '''Efficient proofs of inclusion.''' It should be possible to
extract a proof of key/value mapping which is limited in size and
verification time by the length of the key in the worst case.
* '''Computation of updates using local information.''' Given a set of
inclusion proofs, it should be possible to calculate adjustments to
the local mapping structure (update or deletion of included mappings,
or insertion between two included mappings which are adjacent in the
global structure).

Such applications include committed validation indices which enable
stateless mining nodes, committed wallet indices which enable
trust-less querying of the unspent transaction output set by
<code>scriptPubKey</code>, efficient document time-stamping, and
secure & efficient merged mining. This BIP describes an authenticated
prefix tree which has the above properties, but leaves the myriad
applications to be formalized in future BIPs.

==Data structure==

This BIP defines a binary prefix tree. Such a structure provides a
mapping of bitstrings (the ''keys'') to bytestrings (the ''values'').
It is an acyclic binary tree which implicitly encodes keys within the
traversal path -- a "left" branch is a 0, and a "right" branch is a 1.
Each node is reachable by only one unique path, and reading off the
branches taken (0 for each left, 1 for each right) as one follows the
path from root to target yields the node's key.

The particular binary prefix tree defined by this BIP is a hybrid
PATRICIA / de la Brandais tree structure.
[http://en.wikipedia.org/wiki/Radix_tree PATRICIA trees] compress a
long sequence of non-branching nodes into a single interior node with
a per-branch ''skip prefix''. This achieves significant savings in
storage space, root hash calculation, and traversal time.

A de la Brandais trie achieves compression by only storing branches
actually taken in a node. The space savings are minimal for a binary
tree, but place the serialized size of a non-branching interior node
under the SHA-256 block size, thereby reducing the number of hash
operations required to perform updates and validate proofs.

This BIP describes the authenticated prefix tree and its many
variations in terms of its serialized representation. Additional BIPs
describe the application of authenticated prefix trees to such
applications as committed indices, document time-stamping, and merged
mining.

==Serialization format==

As a hierarchical structure, the serialization of an entire tree is
the serialization of its root node. A serialized node is the
concatenation of five structures:

    node := flags || VARCHAR(extra) || value || left || right

The <code>flags</code> is a single byte field whose composite values
determine the bytes that follow.

    flags = (left_flags  << 0) |
            (right_flags << 2) |
            (has_value   << 4) |
            (prune_left  << 5) |
            (prune_right << 6) |
            (prune_value << 7)

The <code>left_flags</code> and <code>right_flags</code> are special
2-bit enumeration fields. A value of 0 indicates that the node does
not branch in this direction, and the corresponding <code>left</code>
or <code>right</code> branch is missing (replaced with the empty
string in the node serialization). A value of 1 indicates a single bit
key prefix for this branch, implicitly 0 for <code>left</code> and 1
for <code>right</code>. A 2 indicates up to 7 bits of additional skip
prefix (beyond the implicit first bit, making 8 bits total) are stored
in a compact single-byte format. A 3 indicates a skip prefix with
greater than 7 additional bits, stored length-prefix encoded.

The single bit <code>has_value</code> indicates whether the node
stores a data bytestring, the value associated with its key prefix.
Since keys may be any value or length, including one key being a
prefix of another, it is possible for interior nodes in addition to
leaf nodes to have values associated with them, and therefore an
explicit value-existence bit is required.

The remaining three bits are used for proof extraction, and are masked
away prior to hash operations. <code>prune_left</code> indicates that
the entire left branch has been pruned. <code>prune_right</code> has
similar meaning for the right branch. If <code>has_value</code> is
set, <code>prune_value</code> may be set to exclude the node's value
from encoded proof. This is necessary field for interior nodes, since
it is possible that their values may be pruned while their children
are not.

The <code>value</code> field is only present if the bit
<code>flags.has_value</code> is set, in which case it is a
<code>VARCHAR</code> bytestring:

    switch flags.has_value:
      case 0:
        value := ε
      case 1:
        value := VARCHAR(node.value)

The <code>extra</code> field is always present, and takes on a
bytestring value defined by the particular application. Use of the
<code>extra</code> field is application dependent, and will not be
covered in this specification. It can be set to the empty bytestring
(serialized as a single zero byte) if the application has no use for
the <code>extra</code> field.

    value := VARCHAR(calculate_extra(node))

The <code>left</code> and <code>right</code> non-terminals are only
present if the corresponding <code>flags.left_flags</code> or
<code>flags.right_flags</code> are non-zero. The format depends on the
value of this flags setting:

    switch branch_flags:
      case 0:
        branch := ε
      case 1:
        branch := branch_node_or_hash
      case 2:
        prefix  = prefix >> 1
        branch := int_to_byte(1 << len(prefix) | bits_to_int(prefix)) ||
                  branch_node_or_hash
      case 3:
        prefix  = prefix >> 1
        branch := VARINT(len(prefix) - 9) ||
                  bits_to_string(prefix) ||
                  branch_node_or_hash

<code>branch_flags</code> is a stand-in meant to describe either
<code>left_flags</code> or <code>right_flags</code>, and likewise
everywhere else in the above pseudocode <code>branch</code> can be
replaced with either <code>left</code> or <code>right</code>.

<code>prefix</code> is the key bits between the current node and the
next branching, terminal, and/or leaf node, including the implicit
leading bit for the branch (0 for the left branch, 1 for the right
branch). In the above code, <code>len(prefix)</code> returns the
number of bits in the bitstring, and <code>prefix >> 1</code> drops
the first bit reducing the size of the bitstring by one and
renumbering the indices accordingly.

The function <code>int_to_byte</code> takes an integer in the range
[0, 255] and returns the octet representing that value. This is a NOP
in many languages, but present in this pseudocode so as to be explicit
about what is going on.

The function <code>bits_to_int</code> interprets a sequence of bits as
a little-endian integer value. This is analogous to the following
pseudocode:

    def bits_to_int(bits):
        result = 0
        for idx in 1..len(bits):
            if bits[idx] == 1:
                result |= 1<<idx

The function <code>bits_to_string</code> serializes a sequence of bits
into a binary string. It uses little-endian bit and byte order, as
demonstrated by the following pseudocode:

    def bits_to_string(bits):
        bytes = [0] * ceil(len(bits) / 8)
        for idx in 1..len(bits):
            if bits[idx] == 1:
                bytes[idx / 8] |= 1 << idx % 8
        return map(int_to_byte, bytes)

<code>branch_node_or_hash</code> is either the serialized child node
or its SHA-256 hash and associated meta-data. Context determines which
value to use: during digest calculations, disk/database serialization,
and when the branch is pruned the hash value is used and serialized in
the same way as other SHA-256 values in the bitcoin protocol (note
however that it is single-SHA-256, not the double-SHA-256 more
commonly used in bitcoin). The number of terminal (value-containing)
nodes and the serialized size in bytes of the fully unpruned branch
are suffixed to the branch hash. When serializing a proof or
snapshotting tree state and the branch is not pruned, the serialized
child node is included directly and the count and size are omitted as
they can be derived from the serialization.

    if branch_pruned or SER_HASH:
        branch_node_or_hash := SHA-256(branch) ||
                               count(branch) ||
                               size(branch)
    else:
        branch_node_or_hash := serialize(branch)

As an example, here is the serialization of a prefix tree mapping the
names men and women of science to the year of their greatest publication:

    >>> dict = AuthTree()
    >>> dict['Curie'] = VARINT(1898)
    >>> dict('Einstein') = VARINT(1905)
    >>> dict['Fleming'] = VARINT(1928)
    >>> dict['中本'] = VARINT(2009)
    >>> dict.serialize()
    # An bytestring, broken out into parts:

    # . Root node:
    0x0e # left_flags: 2, right_flags: 3, has_value: 1
    0x00 # extra: ε

    # .l Inner node: 0b01000
    0x11 # 0b01000
    0x07 # left_flags: 3, right_flags: 1
    0x00 # extra: ε

    # .l.l Inner node: 0b01000011 0b01110101 0b01110010 0b01101001
    #                  'C'        'u'        'r'        'i'
    #                  0b01100101
    #                  'e'
    0x1abb3a599a02 # 0b01101110101011100100110100101100101
    0x10           # has_value: 1
    0x00           # extra: ε
    0x03fd6a07     # value: VARINT(1911)

    # .l.r Inner node: 0b010001
    0x0f # left_flags: 3, right_flags: 3
    0x00 # extra: ε

    # .l.r.l Inner node: 0b01000101 0b01101001 0b01101110 0b01110011
    #                    'E'        'i'        'n'        's'
    #                    0b01110100 0b01100101 0b01101001 0b01101110
    #                    't'        'e'        'i'        'n'
    0x312ded9c5d4c2ded00 # 0b1011010010110111
                         # 0b0011100110111010
                         # 0b0011001010110100
                         # 0b101101110
    0x10                 # has_value: 1
    0x00                 # extra: ε
    0x03fd7107           # value: VARINT(1905)

    # .l.r.r Inner node: 0b01000110 0b01101100 0b01100101 0b01101101
    #                    'F'        'l'        'e'        'm'
    #                    0b01101001 0b01101110 0b01100111
    #                    'i'        'n'        'g'
    0x296c4c6d2dedcc01 # 0b0011011000110010
                       # 0b1011011010110100
                       # 0b10110111001100111
    0x10               # has_value: 1
    0x00               # extra: ε
    0x03fd8807         # value: VARINT(1928)

    # .r Inner node: 0b11100100 0b10111000 0b10101101
    #                '中'
    #                0b11100110 0b10011100 0b10101100
    #                '本'
    0x27938edab39c1a # 0b1100100101110001
                     # 0b0101101111001101
                     # 0b001110010101100
    0x10             # has_value: 1
    0x00             # extra: ε
    0x03fdd907       # value: VARINT(2009)

==Hashing==

There are two variations of the authenticated prefix tree presented in
this draft BIP. They differ only in the way in which hash values of a
node and its left/right branches are constructed. The variations,
discussed below, tradeoff computational resources for the ability to
compose operational proofs. Whether the performance hit is
significant, and whether or not the added features are worth the
tradeoff depends very much on the application.

===Variation 1: Level-compressed hashing===

In this variation the referenced child node's hash is used in
construction of an interior node's hash digest. The interior node is
serialized just as described (using the child node's digest instead of
inline serialization), the resulting bytestring is passed through one
round of SHA-256, and the digest that comes out of that is the hash
value of the node. This is very efficient to calculate, requiring the
absolute minimum number of SHA-256 hash operations, and achieving
level-compression of computational resources in addition to reduction
of space usage.

For example:

    >>> dict = AuthTree()
    >>> dict['a'] = 0xff
    >>> dict.serialize()
    0x0200c3100001ff
    >>> dict.root
    AuthTreeNode(
        left_prefix = 0b01100001,
        left_hash   =
0xbafa0e2bba3396c5e9804b6cbe61be82bc442c1121aed81f8d5de36e9b20dc2f,
        left_count  = 1,
        left_size   = 4)
    >>> dict.hash
    0xb4837376022a7c9ddaa7d685ad183bcbd5d16c362b81fa293a7b9e911766cf3c

Assuming uniform distribution of key values, level-compressed hashing
has time-complexity logarithmic with respect to the number of keys in
the prefix tree. The disadvantage is that it is not possible in
general to "rebase" an operational proof on top of a sibling,
particularly if that sibling deletes branches that result in
reorganization and level compression of internal nodes used by the
rebased proof.

===Variation 2: Proof-updatable hashing===

In this variation, level-compressed branches are expanded into a
series of chained single-branch internal nodes, each including the
hash of its direct child. For a brach with a prefix N bits in length,
this requires N chained hashes. Thanks to node-compression (excluding
empty branches from the serialization), it is possible for each hash
operation + padding to fit within a single SHA-256 block.

Note that the serialization semantics are unchanged! The variation
only changes the procedure for calculating the hash values of interior
nodes. The serialization format remains the same (modulo differing
hash values standing in for pruned branches).

Using the above example, calling <code>dict.hash</code> causes the
following internal nodes to be constructed:

    >>> node1 = AuthTreeNode(
        right_prefix = 0b1,
        right_hash   =
0xbafa0e2bba3396c5e9804b6cbe61be82bc442c1121aed81f8d5de36e9b20dc2f,
        right_count  = 1,
        right_size   = 4)
    >>> node2 = AuthTreeNode( left_prefix=0b0,  left_hash=node1.hash,
 left_count=1,  left_size=4)
    >>> node3 = AuthTreeNode( left_prefix=0b0,  left_hash=node2.hash,
 left_count=1,  left_size=4)
    >>> node4 = AuthTreeNode( left_prefix=0b0,  left_hash=node3.hash,
 left_count=1,  left_size=4)
    >>> node5 = AuthTreeNode( left_prefix=0b0,  left_hash=node4.hash,
 left_count=1,  left_size=4)
    >>> node6 = AuthTreeNode(right_prefix=0b1, right_hash=node5.hash,
right_count=1, right_size=4)
    >>> node7 = AuthTreeNode(right_prefix=0b1, right_hash=node6.hash,
right_count=1, right_size=4)
    >>> node8 = AuthTreeNode( left_prefix=0b0,  left_hash=node7.hash,
 left_count=1,  left_size=4,
                              value=0xff)
    >>> dict.hash == node8.hash
    True
    >>> dict.hash
    0xc3a9328eff06662ed9ff8e82aa9cc094d05f70f0953828ea8c643c4679213895

The advantage of proof-updatable hashing is that any operational proof
may be "rebased" onto the tree resulting from a sibling proof, using
only the information locally available in the proofs, even in the
presence of deletion operations that result in level-compression of
the serialized form. The disadvantage is performance: validating an
updatable proof requires a number of hash operations lower-bounded by
the length of the key in bits.

==Inclusion proofs==

An inclusion proof is a prefix tree pruned to contain a subset of its
keys. The serialization of an inclusion proof takes the following form:

    inclusion_proof := variant || root_hash || root_node || checksum

Where <code>variant</code> is a single-byte value indicating the
presence of level-compression (0 for proof-updatable hashing, 1 for
level-compressed hashing). <code>root_hash</code> is the Merkle
compression hash of the tree, the 32-byte SHA-256 hash of the root
node. <code>tree</code> is the possibly pruned, serialized
representation of the tree. And finally, <code>checksum</code> is the
first 4 bytes of the SHA-256 checksum of <code>variant</code>,
<code>root_hash</code>, and <code>root_node</code>.

For ease of transport, the standard envelope for display of an
inclusion proof is internet-standard base64 encoding in the following
format:

- -----BEGIN INCLUSION PROOF-----
ATzPZheRnns6KfqBKzZs0dXLOxithdan2p18KgJ2c4O0DgARBwAauzpZmgIQAAP9agcPADEt7Zxd
TC3tABAAA/1xBylsTG0t7cwBEAAD/YgHJ5OO2rOcGhAAA/3ZByEg+2g=
- -----END INCLUSION PROOF-----

Decoded, it looks like this:

    0x01 # Level-compressed hashing
    # Merkle root:
    0x3ccf6617919e7b3a29fa812b366cd1d5cb3b18ad85d6a7da9d7c2a02767383b4
    # Serialized tree (unpruned):
    0x0e001107001abb3a599a02100003fd6a070f00312ded9c5d4c2ded00100003fd
    0x7107296c4c6d2dedcc01100003fd880727938edab39c1a100003fdd907
    # Checksum:
    0x2120fb68

==Operational proofs==

An operational proof is a list of insert/update and delete operations
suffixed to an inclusion proof which contains the pathways necessary
to perform the specified operations. The inclusion proof must contain
the key values to be updated or deleted, and the nearest adjacent key
values for each insertion. The serialization of an operational proof
takes the following form:

    operational_proof := variant || root_hash || tree ||
                         VARLIST(delete*) || VARLIST(update*) ||
                         new_hash || checksum

    delete := VARCHAR(key)
    update := VARCHAR(key) || VARCHAR(value)

The first three fields, <code>variant</code>, <code>root_hash</code>,
and <code>tree</code> are the inclusion proof, and take the same
values described in the previous section. <code>deletes</code> is a
list of key values to be deleted; each key value in this list must
exist in the inclusion proof. <code>updates</code> is a list of key,
value mappings which are to be inserted into the tree, possibly
replacing any mapping for the key which already exists; either the key
itself if it exists (update), or the two lexicographically closest
keys on either side if it does not (insert) must be present in the
insertion proof. <code>new_hash</code> is the resulting Merkle root
after the insertion, updates, and deletes are performed, and
<code>checksum</code> is the initial 4 bytes of the SHA-256 hash of
the preceding fields.

Just like inclusion proofs, an operational proof is encoded in base64
for display and transport. Here's the same

- -----BEGIN OPERATIONAL PROOF-----
ATzPZheRnns6KfqBKzZs0dXLOxithdan2p18KgJ2c4O0LgARaIsVaQi/GdhOPOgA8p4Pu4PiEfEg
lcmy3j7bOc7hXw0DLSeTjtqznBoQAAP92QcBMOS4reacrACzuZJbyP7fqIOf5VEk4iarG4+uPoZC
oun8BztQMQBy0LHVeSY=
- -----END OPERATIONAL PROOF-----

Decoded and broken into its constituent fields:

    0x01 # Level-compressed hashing
    # Original Merkle root:
    0x3ccf6617919e7b3a29fa812b366cd1d5cb3b18ad85d6a7da9d7c2a02767383b4
    # Serialized tree (included keys: '中本'):
    0x2e0011688b156908bf19d84e3ce800f29e0fbb83e211f12095c9b2de3edb39ce
    0xe15f0d032d27938edab39c1a100003fdd907
    # Deletion list ['中本']:
    0x01
    0x30e4b8ade69cac
    # Insertion list []:
    0x00
    # New Merkle root:
    0xb3b9925bc8fedfa8839fe55124e226ab1b8fae3e8642a2e9fc073b50310072d0
    # Checksum:
    0xb1d57926

~End of File~
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJSs6HIAAoJEAdzVfsmodw4gooQAJm7XNsZjgdeTSpKIvUIU38f
tQx2FD08hQdLl48me5mDUbHJgGlYINsKAgoZ8Mqwi/kHEEYhuIlLIX1p6Ovigidb
21BiVoOLdG1egGOwxp17DuwYaDPTppFTlN9TBjZzW6WKc7+4aNvyc1KtrbHIhtj/
04ekFyAn4U5UH0ht7CI79j0u3Kp85p5D4PyYZB2m82mzti6OxpSM4tXlMkDW7ihg
QJwiZSjzejqTd7WF0zr0SLeGVRSN2A0dzUCoVsI98eIa3hkw2N4ae6dRkibyStOT
V8VEDvHArEDlvu8jiryajhsom5mvtOOclNDkVXWAf/Te4gj05iYdTIvNvDEJtqsP
XDbmw6GgV1kBLlLo0mp//t/+wr+nIvy+sVAP+eqtM/0vjaVXBkXxkUMqqNkrtVpB
f3whq7nFahssUMSoWE93jgob1ayAax2XUALVMAXYsJl7b2MqBGlhiTZ8FQZ+TW4A
tIpKeUprPmDvA18rO3SCbmLMQryZqYiH0sRyvUc5kvn3qCRHrISZNkEuK591eS+x
BO1eOluPzVqeXPPSK1jvGeY0FNJtwzbov4nI1mzOvzQHLCvkHn5PhUFCK5tL5tAe
b0Z5qwDV+SvVs7W1R7ejYBzEj77U1zuzZ9AtikOuvy+bNGrkIlpI49EyXHijm7C3
Q6JacTuI0PelYji2gaBJ
=BbDs
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Bitcoin-development] BIP proposal: Authenticated prefix trees
  2013-12-20  1:47 [Bitcoin-development] BIP proposal: Authenticated prefix trees Mark Friedenbach
@ 2013-12-20  6:48 ` Jeremy Spilman
  2013-12-20 11:21   ` Mark Friedenbach
  2013-12-20 10:48 ` Peter Todd
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Jeremy Spilman @ 2013-12-20  6:48 UTC (permalink / raw)
  To: Bitcoin Dev

Wow there's a lot here to think about. I'm pretty sure I haven't grasped  
the full implications yet.

I see it proposes to also introduce additional BIPs describing the use of  
the data stucture for stateless validation & mining, the UBC address index  
for "SPV+" operating modes, document timestamping and merged mining.

Can the BIP stand alone as a BIP without some specific changes to the  
protocol or end-user accessible features defined within it? It seems like  
an extremely useful data stucture, but as I understand it the purpose of  
BIPS is defining interoperability points, not implementation details?

Unless the tree itself is becoming part of the protocol, seems like its  
spec, test vectors, and reference implementation can live elsewhere, but I  
would love to read about BIPS which use this tree to accomplish some  
amazing scalability or security benefits.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Bitcoin-development] BIP proposal: Authenticated prefix trees
  2013-12-20  6:48 ` Jeremy Spilman
@ 2013-12-20 11:21   ` Mark Friedenbach
  2013-12-20 13:17     ` Peter Todd
  0 siblings, 1 reply; 14+ messages in thread
From: Mark Friedenbach @ 2013-12-20 11:21 UTC (permalink / raw)
  To: bitcoin-development

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Jeremy, Let's give a preview of the application-oriented BIPs I
mentioned:

Stateless validation and mining involves prefixing transaction and
block messages with proofs of their UTxO state changes. These are the
"operational proofs" I describe in the draft, and they work on prefix
trees whose root hashes committed to the coinbase in a soft-fork
upgrade of the validation rules.

"Ultimate blockchain compression" involves consensus over an address
index, which can be queried over the p2p network by lightweight nodes.
The structure of the index is an authenticated prefix tree, and the
results of such a query is an an inclusion proof.

Document time-stamping and this new method of merged mining use the
same structure: a prefix tree whose root hash value is committed to a
pruneable output of the coinbase transaction. Document timestamp
proofs and merged mining proof-of-works are inclusion proofs over this
tree.

I hope that shows how the BIP directly affects interoperability of the
bitcoin protocol and clients which use these applications. I released
this BIP first to get some feedback on the structure itself, which
will be used by all of the application-specific BIPs which follow.

Stepping back and speaking generically, the purpose of a BIP as I see
it is to standardize details which affect interoperability between
clients. In fact, at a cursory glance only about half of the BIPs deal
with protocol issues directly - the rest deal with local /
user-interface issues like key derivation or JSON-RPC APIs. Even if
none of the applications involved protocol changes, I still think BIPs
like this would be of value in that they serve to standardize things
which are or will seek to become commonly used and widely implemented.

Cheers,
Mark

On 12/19/2013 10:48 PM, Jeremy Spilman wrote:
> Wow there's a lot here to think about. I'm pretty sure I haven't
> grasped the full implications yet.
> 
> I see it proposes to also introduce additional BIPs describing the
> use of the data stucture for stateless validation & mining, the UBC
> address index for "SPV+" operating modes, document timestamping and
> merged mining.
> 
> Can the BIP stand alone as a BIP without some specific changes to
> the protocol or end-user accessible features defined within it? It
> seems like an extremely useful data stucture, but as I understand
> it the purpose of BIPS is defining interoperability points, not
> implementation details?
> 
> Unless the tree itself is becoming part of the protocol, seems like
> its spec, test vectors, and reference implementation can live
> elsewhere, but I would love to read about BIPS which use this tree
> to accomplish some amazing scalability or security benefits.
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJStChCAAoJEAdzVfsmodw42DcQAIlgkukh5K/XYloIiT5pgaHS
xCZXtOvxpNUep8x35rvEO1ePjvPvUkbUE2jRw2se1rSMkhzw3PpHHtXV/gIOGqUe
WVKeeIM5pZX56sEcEdUQ1pTwB2rmtSNeyCuHl8fLatk8eLhcAHcpv/7esLuAjWCr
EE840s8+q3ltuzKi3nqxK84bwIohgSMKhncfonNp5lMAtug8Itqopq3DPDoxwiK/
qUwSz5UCEMH6oNHnywzhKGUhBErqo4q8IgAKcZYBZZ9n4BRjf4ngoCw9n5wCef8v
tyTvwrg0nSQTQa67cg7RCsY7SisrI9gaMvCYTSvEMKdw9X0aqAX1p0yZpTbV+dIr
Q2ZT6gJmg2sD22zKY1/58oq+PiNO+nRS81OG2znZofsIfhOVW0bIZAQ8+zZtFW40
vXxMuHCNieCK8e7f9A6LLv/Zz7rmNxdQ6cHBEL1nIs1Y4d1FpHJVI2LHi54QmVXf
C5PKF/e7K2eD3LZMNxS818rZaiJJ7qmpjS3rkG2owHyJHEhBJIlkYXfI1YCraT+b
R5AzAh2Oz0Nyb5ChP2VSaecJNjGvRMo7Z6HCytmgAGOEcDDZkxSv0kkprbvchqXx
XziFgA4iSajBKYWPiPLGMADfMPT6zd4fhDjyaN8+LPO3d3ZK1VwmQDLRQ3DxfeIP
RgchHR/pS77XI7hCFwtN
=ao17
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Bitcoin-development] BIP proposal: Authenticated prefix trees
  2013-12-20 11:21   ` Mark Friedenbach
@ 2013-12-20 13:17     ` Peter Todd
  2013-12-20 18:41       ` Mark Friedenbach
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Todd @ 2013-12-20 13:17 UTC (permalink / raw)
  To: Mark Friedenbach; +Cc: bitcoin-development

[-- Attachment #1: Type: text/plain, Size: 2917 bytes --]

On Fri, Dec 20, 2013 at 03:21:38AM -0800, Mark Friedenbach wrote:
> Hi Jeremy, Let's give a preview of the application-oriented BIPs I
> mentioned:
> 
> Stateless validation and mining involves prefixing transaction and
> block messages with proofs of their UTxO state changes. These are the
> "operational proofs" I describe in the draft, and they work on prefix
> trees whose root hashes committed to the coinbase in a soft-fork
> upgrade of the validation rules.
> 
> "Ultimate blockchain compression" involves consensus over an address
> index, which can be queried over the p2p network by lightweight nodes.
> The structure of the index is an authenticated prefix tree, and the
> results of such a query is an an inclusion proof.

I've thought about this for awhile and come to the conclusion that UTXO
commitments are a really bad idea. I myself wanted to see them
implemented about a year ago for fidelity bonded banks, but I've changed
my mind and I hope you do too.

They force miners and every full node with SPV clients to store the
entire UTXO set in perpetuity. This is bad by itself, but then they make
it even worse by making Bitcoin really useful and convenient to use as a
decentralized database; UTXO commitments make it easy and convenient to
implement systems like Namecoin on top of Bitcoin, yet we don't have the
UTXO expiration that might make such uses reasonable. Right now the UTXO
set is reasonable small - ~300MB - but that can and will change if we
make it an attractive way to store data. UTXO commitments do exactly
that.

You're also *not* giving users what they actually want: the transactions
associated with their wallets. Even though Electrum could easily work
via a pure UTXO database they implemented transaction lookup instead;
Electrum servers cough up every transaction associated with a user's
wallet. If you're going to do that, it's just as easy to do per-block
lookup trees which don't force the UTXO set to be stored.

There's also a more subtle issue: the security model of UTXO commitments
sucks. It encourages wallets to essentially trust single confirmations
because it's unlikely that nodes will want to store the multiple copies
of the UTXO set required to provide proof of multiple confirmations.
Basically the issue is when you start your wallet you get a proof of
UTXO set for the most recent block; that's just one confirmation. To get
more confirmations you have to wait for subsequent blocks, checking the
set on each block. Per block indexes on the other hand naturally lead
wallets to count confirmations properly.

IMO you should take this technology to Namecoin instead. For them the
fast lookups are probably worth the trade-offs, and they expire domains
so the total set size doesn't grow unbounded.

-- 
'peter'[:-1]@petertodd.org
00000000000000028fd077fb1e33e942e3e875aa29cec134fed89d650242c577

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 681 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Bitcoin-development] BIP proposal: Authenticated prefix trees
  2013-12-20 13:17     ` Peter Todd
@ 2013-12-20 18:41       ` Mark Friedenbach
  0 siblings, 0 replies; 14+ messages in thread
From: Mark Friedenbach @ 2013-12-20 18:41 UTC (permalink / raw)
  To: Peter Todd; +Cc: bitcoin-development

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

(Sorry Peter, this was meant for the whole list:)

On 12/20/2013 05:17 AM, Peter Todd wrote:
> I've thought about this for awhile and come to the conclusion that 
> UTXO commitments are a really bad idea. I myself wanted to see them
> implemented about a year ago for fidelity bonded banks, but I've
> changed my mind and I hope you do too.
> 
> They force miners and every full node with SPV clients to store the
> entire UTXO set in perpetuity.

This is incorrect. If the slower proof-updatable hashes are used, then
mining only requires what I've called "operational proofs" to be
attached to received transactions and blocks.

Access to the UTXO set is required to make new transactions, at least
for the outputs of the transaction, but I do not believe this is as
significant a problem as you do. It is a service that can be
outsourced for a minimal fee - include an explicit output of the
necessary amount to a scriptPubKey specified by the archival node, and
they will make sure the proper proofs are attached.

> This is bad by itself, but then they make it even worse by making 
> Bitcoin really useful and convenient to use as a decentralized 
> database; UTXO commitments make it easy and convenient to
> implement systems like Namecoin on top of Bitcoin, yet we don't
> have the UTXO expiration that might make such uses reasonable.
> Right now the UTXO set is reasonable small - ~300MB - but that can
> and will change if we make it an attractive way to store data.
> UTXO commitments do exactly that.

You might have to explain this to me, but it is not clear to me how
the validation index could be twisted into providing a Namecoin-like
system. Or the address index either, which I presume is what you are
referring to. Namecoin works by assigning domains to outputs, and then
tracking ownership and configuration of that domain through chains of
outputs. But the UTXO set doesn't contain connecting information. At
best all it would be is a glorified, and expensive time-stamper,
unattractive because there are already better solutions.

> You're also *not* giving users what they actually want: the 
> transactions associated with their wallets. Even though Electrum 
> could easily work via a pure UTXO database they implemented 
> transaction lookup instead; Electrum servers cough up every 
> transaction associated with a user's wallet. If you're going to do 
> that, it's just as easy to do per-block lookup trees which don't 
> force the UTXO set to be stored.

At the cost of having the supposedly lightweight client query for each
of its coins on every single block, to construct a negative
proof-of-spend.

> There's also a more subtle issue: the security model of UTXO 
> commitments sucks. It encourages wallets to essentially trust 
> single confirmations because it's unlikely that nodes will want to 
> store the multiple copies of the UTXO set required to provide
> proof of multiple confirmations. Basically the issue is when you
> start your wallet you get a proof of UTXO set for the most recent
> block; that's just one confirmation. To get more confirmations you
> have to wait for subsequent blocks, checking the set on each block.
> Per block indexes on the other hand naturally lead wallets to
> count confirmations properly.

I don't think this is true, or at least you are not considering
available optimizations. You certainly don't need to store multiple
copies of the UTXO set.

I'm a little confused as to the exact situation you are describing.
When a key is loaded into a wallet, or a wallet comes online after a
significant absence, it looks for coins in the current UTXO set. If
any coins are found, their attached transaction record has a block
height field, so the confirmation count can be derived from that. As
blocks go by that count is naturally increased. I'm not sure how this
is different from the current situation.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJStI9aAAoJEAdzVfsmodw4IooP/1uK9cvP1vxXyQRbAHf9oFXw
AmZ8p1+t8f6MHUpjkv/Xn0poFNU8qSnNz65drQdq8ErcJnqe4V3Wt6G32/uCxvZs
6AX6bRYQIfhHY0DBPgfacO5/ALdlnS4NdjWFCA5hHDgLd30BpbU1WK1ze985TXrd
+ucQkzcMYEDW2lb+sFvfhpi9ZPFd34ZrNzH//oS794eYKWAmj7jXqdgxk5AKat61
Xileq5beE4xom3pChXc3PtIJKsoil5SjE20/FW52wcCdyaEFG0kwl937pEGjQnlP
mylK/ilfZ6cvRC8MmVnl/6AC4V2hjB4Ncej03jG3JI2FdaJEOHuHg0uh8/Zl1I4A
YVIKyrHQhQb/VGsfXtW3zokHzDeEtJwlx+PPFaLc9QurFirNjSnenhbw4Vpbg3Xt
dH1Qd9xWcT85a19Oz8Q4rt3z7UmX9J/geZrUHCuPtr47yXU0e1Cc6ZP9zDYNtfKU
q6MjNZiaLJ/Wp0n4IeQ/4/wqy0rM/psP9i5d6IdP96tayVM9aKj5Lh9lU/Od5wGO
2PPB4kvhJfMbx3o+S7UK5vra7ysZzULpoVGDpUR3xRM72l//vlNhSLK5nILVO3r8
sIC5+3WoZLUKvuNo45/BDxXHZajrWLCU84WrwHVm1u7SHfBQcoES/rhcx2zlgyx0
/Iwxsgb7Fznl+eM2bEpZ
=TtaV
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Bitcoin-development] BIP proposal: Authenticated prefix trees
  2013-12-20  1:47 [Bitcoin-development] BIP proposal: Authenticated prefix trees Mark Friedenbach
  2013-12-20  6:48 ` Jeremy Spilman
@ 2013-12-20 10:48 ` Peter Todd
       [not found]   ` <52B425BA.6060304@monetize.io>
  2013-12-20 19:48 ` Gregory Maxwell
  2014-01-05 18:43 ` Thomas Voegtlin
  3 siblings, 1 reply; 14+ messages in thread
From: Peter Todd @ 2013-12-20 10:48 UTC (permalink / raw)
  To: Mark Friedenbach; +Cc: Bitcoin Dev

[-- Attachment #1: Type: text/plain, Size: 1084 bytes --]

On Thu, Dec 19, 2013 at 05:47:52PM -0800, Mark Friedenbach wrote:
> This BIP describes the authenticated prefix tree and its many
> variations in terms of its serialized representation. Additional BIPs
> describe the application of authenticated prefix trees to such
> applications as committed indices, document time-stamping, and merged
> mining.

Could you expand more on how prefix trees could be used for
time-stamping and merged mining?


>     >>> dict = AuthTree()
>     >>> dict['Curie'] = VARINT(1898)
>     >>> dict('Einstein') = VARINT(1905)
>     >>> dict['Fleming'] = VARINT(1928)
>     >>> dict['中本'] = VARINT(2009)

I'd be inclined to leave the unicode out of the code examples as many
editors and shells still don't copy-and-paste it nicely. Using it in BIP
documents themselves is fine and often has advantages re: typesetting,
but using it in crypto examples like this just makes it harder to
reproduce the results by hand unnecessarily.

-- 
'peter'[:-1]@petertodd.org
0000000000000002d7a0c56ae2c5b2b3322d5017cfef847455d4d86a6bc12280

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 685 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <52B425BA.6060304@monetize.io>]

* Re: [Bitcoin-development] BIP proposal: Authenticated prefix trees
       [not found]   ` <52B425BA.6060304@monetize.io>
@ 2013-12-20 12:47     ` Peter Todd
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Todd @ 2013-12-20 12:47 UTC (permalink / raw)
  To: Mark Friedenbach, bitcoin-development

[-- Attachment #1: Type: text/plain, Size: 2655 bytes --]

On Fri, Dec 20, 2013 at 03:10:50AM -0800, Mark Friedenbach wrote:
> On 12/20/2013 02:48 AM, Peter Todd wrote:
> > On Thu, Dec 19, 2013 at 05:47:52PM -0800, Mark Friedenbach wrote:
> >> This BIP describes the authenticated prefix tree and its many 
> >> variations in terms of its serialized representation. Additional
> >> BIPs describe the application of authenticated prefix trees to
> >> such applications as committed indices, document time-stamping,
> >> and merged mining.
> > 
> > Could you expand more on how prefix trees could be used for 
> > time-stamping and merged mining?
> 
> The root hash of a prefix tree is placed in the coinbase at a location
> standardized by convention.

Right, last txout in an OP_RETURN like we discussed.

> For document time-stamping, the key can be
> the hash of the document.

Don't you mean the value is the hash of the document and the key is
irrelevant?

> For merged mining, the key is the hash of
> the genesis block of the altchain, and the value is the hash of the
> aux-pow (for p2pool, the share hash).

What's the advantage over the direction-based system I proposed before?
Seems to me the code required to validate the proof is significantly
more complex in your scheme.

http://www.mail-archive.com/bitcoin-development@lists.sourceforge.net/msg03149.html

> In the system I have in mind this adds 43 bytes to the coinbase
> transaction,

By 43 bytes you mean the whole op_return txout right?

> >>>>> dict = AuthTree() dict['Curie'] = VARINT(1898) 
> >>>>> dict('Einstein') = VARINT(1905) dict['Fleming'] =
> >>>>> VARINT(1928) dict['中本'] = VARINT(2009)
> > 
> > I'd be inclined to leave the unicode out of the code examples as
> > many editors and shells still don't copy-and-paste it nicely. Using
> > it in BIP documents themselves is fine and often has advantages re:
> > typesetting, but using it in crypto examples like this just makes
> > it harder to reproduce the results by hand unnecessarily.
> 
> Thanks for the feedback, I rather agree. When I was creating that
> example for some reason I wanted the right branch of the root node to
> be used, which is difficult when only 7-bit ASCII keys are used. But I
> don't think the illustrative point I had in mind ended up being
> particularly relevant, so I'll rework it.

That example is python, so I'd suggest just using escape sequences
myself. You probably also should include the "b" prefix to make the
strings explicitly binary for py3 compatibility, ie dict[b'\xbe\xef']

-- 
'peter'[:-1]@petertodd.org
000000000000000216e3750a9ad9584395352d728a3c543844eab3bfc9ce1073

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 685 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Bitcoin-development] BIP proposal: Authenticated prefix trees
  2013-12-20  1:47 [Bitcoin-development] BIP proposal: Authenticated prefix trees Mark Friedenbach
  2013-12-20  6:48 ` Jeremy Spilman
  2013-12-20 10:48 ` Peter Todd
@ 2013-12-20 19:48 ` Gregory Maxwell
  2013-12-20 22:04   ` Mark Friedenbach
  2014-01-05 18:43 ` Thomas Voegtlin
  3 siblings, 1 reply; 14+ messages in thread
From: Gregory Maxwell @ 2013-12-20 19:48 UTC (permalink / raw)
  To: Mark Friedenbach; +Cc: Bitcoin Dev

On Thu, Dec 19, 2013 at 5:47 PM, Mark Friedenbach <mark@monetize•io> wrote:
> Hello fellow bitcoin developers. Included below is the first draft of
> a BIP for a new Merkle-compressed data structure. The need for this
> data structure arose out of the misnamed "Ultimate blockchain
> compression" project, but it has since been recognized to have many
> other applications.

A couple very early comments— I shared some of these with you on IRC
but I thought I'd post them to make them more likely to not get lost.

Whats a VARCHAR()  A zero terminated string?  A length prefixed
string? How is the length encoded?  Hopefully not in a way that has
redundancy, since things that don't survive a serialization round trip
is a major trap.

Is the 'middle' the best place for the extradata? Have you
contemplated the possibility that some applications might use midstate
compression?

On that general subject, since the structure here pretty much always
guarantees two compression function invocations. SHA512/256 might
actually be faster in this application.

Re: using sha256 instead of sha256^2, we need to think carefully about
the implications of Merkle-Damgard generic length extension attacks.
It would be unfortunately to introduce them here, even though they're
currently mostly theoretical for sha256.

WRT hash function performance, hash functions are so ludicrously fast
(and will be more so as processors get SHA2 instructions) that the
performance of the raw compression function would hardly ever be a
performance consideration unless you're using a slow interpreted
language (... and that sounds like a personal problem to me). So I
don't think CPU performance should be a major consideration in this
BIP.

What I do think should be a consideration is the cost of validating
the structure under a zero-knowledge proof. An example application is
a blind proof for a SIN or a proof of how much coin you control... or
even a proof that a block was a correctly validated one, and in these
cases additional compression function calls are indeed pretty
expensive. But they're not the only cost, any conditional logic in the
hash tree evaluation is expensive, and particular, I think that any
place where data from children will be combined with a variable offset
(especially if its not word aligned) would potentially be rather
expensive.

I'm unconvinced about the prefix tree compressed applications, since
they break compact update proofs.  If we used them in the Bitcoin
network they could only be used for data where all verifying nodes had
all their data under the tree. I think they add a lot of complexity to
the BIP (esp from people reading the wrong section), so perhaps they
should be split into another document?

In any case, I want to thank you for talking the time to write this
up. You've been working on this stuff for a while and I think it will
be lead to useful results, even if we don't end up using how it was
originally envisioned.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Bitcoin-development] BIP proposal: Authenticated prefix trees
  2013-12-20 19:48 ` Gregory Maxwell
@ 2013-12-20 22:04   ` Mark Friedenbach
  0 siblings, 0 replies; 14+ messages in thread
From: Mark Friedenbach @ 2013-12-20 22:04 UTC (permalink / raw)
  To: Gregory Maxwell, Bitcoin Dev

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 12/20/2013 11:48 AM, Gregory Maxwell wrote:
> A couple very early comments— I shared some of these with you on
> IRC but I thought I'd post them to make them more likely to not get
> lost.

I got the inputs from IRC, but thank you for posting to the list so
that others can see and review.

> Whats a VARCHAR()  A zero terminated string?  A length prefixed 
> string? How is the length encoded?  Hopefully not in a way that
> has redundancy, since things that don't survive a serialization
> round trip is a major trap.

A length-prefixed string, using the shortest representation VARINT for
the length. Same as how scripts are serialized in transactions.

> Is the 'middle' the best place for the extradata? Have you 
> contemplated the possibility that some applications might use
> midstate compression?

Yes I considered midstate compression which is why the branch hashes
come last, but "extra" was an oversight. In every application I've
considered it's either not used (and therefore a single byte), or
updated whenever the node or its children updates.

Honestly I don't expect midestate compression to offer much since in
the nodes that are updated frequently it is unlikely that there will
be enough static data at the front to fill even a 512 bit block of the
smaller hash function.

But it doesn't hurt to prepare just in case. I'll move it to the end.

> On that general subject, since the structure here pretty much
> always guarantees two compression function invocations. SHA512/256
> might actually be faster in this application.

Yes, this is a great suggestion. Moving to SHA-512/256 will let most
inner nodes fit inside a single block, so long as the "extra" field is
not too long. Also apparently SHA-512 is faster on 64-bit CPUs, which
is a nice advantage. I didn't know that.

I'm concerned about speed but I did not go with a faster hash function
because users are more likely to have hardware acceleration for the
SHA-2 family.

> Re: using sha256 instead of sha256^2, we need to think carefully
> about the implications of Merkle-Damgard generic length extension
> attacks. It would be unfortunately to introduce them here, even
> though they're currently mostly theoretical for sha256.

The serialization format encodes lengths in such a way that you cannot
extend the data structure merely by appending bits. You would have to
change the prior, already hashed bits as well. I believe this makes it
immune to length extension attacks.

> WRT hash function performance, hash functions are so ludicrously
> fast (and will be more so as processors get SHA2 instructions) that
> the performance of the raw compression function would hardly ever
> be a performance consideration unless you're using a slow
> interpreted language (... and that sounds like a personal problem
> to me). So I don't think CPU performance should be a major
> consideration in this BIP.

Well.. the UTXO tree is big. Let's assume 5,000 transactions per
block, with an average of 3 inputs/outputs per transaction. This is
close to the worst-case scenario with the current block size. That's
15,000 insert, update, or delete operations.

The number of hashes required when level-compression is used is log2
the number of items in the tree, which for bitcoin is currently about
2.5 million transactions. So that's about ~21 hashes per input/ouput,
or 315,000 hash operations. A CPU is able to do about 100,000 hashes
per second per core, that'll probably take about a second on a modern
4- or 8-core machine.

For updatable proofs, the number of hash operations is equal to the
number of bits in the key, which for the validation index is always
256. That means 3.84 million hashes, or about 10 seconds on a 4-core
machine.

The numbers for the wallet index are worse, as it scales with the
number of outputs, which is necessarily larger, and the keys are longer.

This is not an insignificant cost in the near term, although it is the
type of operation that could be easily offloaded to a GPU or FPGA.

> What I do think should be a consideration is the cost of
> validating the structure under a zero-knowledge proof. An example
> application is a blind proof for a SIN or a proof of how much coin
> you control... or even a proof that a block was a correctly
> validated one, and in these cases additional compression function
> calls are indeed pretty expensive. But they're not the only cost,
> any conditional logic in the hash tree evaluation is expensive, and
> particular, I think that any place where data from children will be
> combined with a variable offset (especially if its not word
> aligned) would potentially be rather expensive.

This is something I know less about, and I welcome constructive input.
There is *no* reason that the hash serialization needs to have fancy
space-saving features. You could even make the SIG_HASH node
serialization into fixed-size, word-aligned data structures.

But this is absolutely not my field, and I may need some hand-holding.
Do the fields need to be at fixed offsets? With fixed widths? Should I
put variable-length stuff like the level-compressed prefixes and value
data at the end (midstate be damned) to keep fixed offsets? What's
expected word alignment, 32-bit or 64-bit?

> I'm unconvinced about the prefix tree compressed applications,
> since they break compact update proofs.  If we used them in the
> Bitcoin network they could only be used for data where all
> verifying nodes had all their data under the tree. I think they add
> a lot of complexity to the BIP (esp from people reading the wrong
> section), so perhaps they should be split into another document?

I believe what you mean by "compact update proofs" is what I call
"updatable proofs", where level compression is only used in the disk
and network serialization. These are what I propose to use for the
validation and wallet indexes, if the computational costs can be
brought under control, because it allows composable proofs.

Unlike a time-ordered index, it does require that someone, somewhere
has random access to the entire UTXO set since you can't predict in
advance what your txid will be. But this is a matter of tradeoffs and
I don't believe at this time that there is a clearcut advantage of one
approach over the other. I'm pursuing a txid-indexed UTXO set because
it most closely matches the current operational model.

That said, you still want level-compression within the serialization
format itself, if for no other reason than to keep proof sizes small.

> In any case, I want to thank you for talking the time to write
> this up. You've been working on this stuff for a while and I think
> it will be lead to useful results, even if we don't end up using
> how it was originally envisioned.

Thanks,
Mark
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJStL7nAAoJEAdzVfsmodw4DQcP/ilB2LTPnbK/UoU+y0d/0CUu
4PVo8VJt0KCUgWbEHIohm0rq4FUpb7FpjzAyQ171jzRykDEkUy7nDh/QsWGUvDvA
gOHKEsX3E+ei8iQMkwlw5D/Lpbb8GNr3SHrU3lvVbXOoaPua9I16778hv3wBWhiN
R70N8dQUwWD1IU0Dfmhi8v2P8OTn4OGTEwS5AQANGCroYyALF+U9EDHjWDMV+bYn
8qrX4v05xjik5YXOv8PNDDp0S9A+KxD72OKL5xlXiE7VbKrYXKt6xNfy1xYgHH8p
u9kWDFMkbis/HAiB5aiFTmxX5/k+yeJw8BfG+txj0xo7b7cWKB9cQLT8vUru2QuH
lHdurxkaBQ+6jqlxYRk7nh0h+obeAXA/CGMseaDYluBg7qTkeWnLORfm7T7fUnHw
fB5sXPUKEeYw48sfs58w/71NbCyl2yYNGlmmugk2SilD3QbUKU1xogNTHEGDuA8M
kPsWW7vRIdI3iy9adgh3LZAvySt7/a5VXXs1li7teDgV4QqH7e2hR0KR8n115N7f
r30LSctbc/MovE9VPb8I7ssQTB7So+1Ki6DbVeQO/8UlCSK5prM3n2sICmT/EVW7
2hNzwbHuEJEWYE7q89buzMRdqbUYSRdG1T1mFBeZ+/n4HH6cweMl6BH4d46LAfuq
BqzTmq5neoCKBwfMfoqg
=YmkZ
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Bitcoin-development] BIP proposal: Authenticated prefix trees
  2013-12-20  1:47 [Bitcoin-development] BIP proposal: Authenticated prefix trees Mark Friedenbach
                   ` (2 preceding siblings ...)
  2013-12-20 19:48 ` Gregory Maxwell
@ 2014-01-05 18:43 ` Thomas Voegtlin
  2014-01-06 18:13   ` Peter Todd
  3 siblings, 1 reply; 14+ messages in thread
From: Thomas Voegtlin @ 2014-01-05 18:43 UTC (permalink / raw)
  To: bitcoin-development

Hello and happy new year to this mailing list!


Thank you Mark for the incredible work you've been doing on this.
I am following this very closely, because it is of primary importance
for Electrum.

I have written a Python-levelDB implementation of this UTXO hashtree,
which is currently being tested, and will be added to Electrum servers.

My implementation follows Alan Reiner's idea to store the tree as items
in a key-value database. I believe that a C++ implementation like yours
will be at least an order of magnitude faster, and I am looking forward 
to it.

I too believe that BIPs should define interoperability points, but probably
not implementation details. For the UTXO hashtree, this means that a BIP
should at least specify how the root hash is constructed. This might be the
only thing that needs to be specified.

However, I see no pressing issue with writing a BIP; it might be preferable
to implement and test different options first, and learn from that.

Thomas



Le 20/12/2013 02:47, Mark Friedenbach a écrit :
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hello fellow bitcoin developers. Included below is the first draft of
> a BIP for a new Merkle-compressed data structure. The need for this
> data structure arose out of the misnamed "Ultimate blockchain
> compression" project, but it has since been recognized to have many
> other applications.
>
> In addition to this BIP I am preparing three additional BIPs
> describing the use of this data structure in stateless validation &
> mining, the UBC address index for "SPV+" operating modes, document
> timestamping and merged mining.
>
> A Python implementation of this data structure is available here:
>
> https://github.com/monetizeio/python-bitcoin
>
> A C++ implementation is being worked on.
>
> As per the BIP-1 procedure, I am submitting this rough draft to the
> community for discussion. I welcome all comments and criticisms of
> both form and content.
>
> - -Mark
>
>
> ==Abstract==
>
> This BIP describes a [http://en.wikipedia.org/wiki/Hash_tree Merkle
> hash tree] variant of the [http://en.wikipedia.org/wiki/Trie
> prefix-tree data structure], ideally suited for encoding key-value
> indices which support memory-efficient proofs.
>
> ==Motivation==
>
> There are a number of applications which would benefit from having a
> data structure with the following properties:
>
> * '''Arbitrary mapping of keys to values.''' A ''key'' can be any
> bytestring, and its ''value'' any other bytestring.
> * '''Duplicate keys disallowed.''' Every key has one, and only one
> value associated with it. Some applications demand assurance that no
> key value is reused, and that this constraint can be checked without
> requiring access to the entire data structure.
> * '''Efficient look-up by key.''' The data structure should support
> sub-linear lookup operations with respect to the number of keys in the
> mapping. Logarithmic time or linear with respect to the length of the
> key should be achievable and would be sufficient for realistic
> applications.
> * '''Merkle compression of mapping structure.''' It should be possible
> to produce a reduced description of the tree consisting of a single
> root hash value which is deterministically calculated from the mapping
> structure.
> * '''Efficient proofs of inclusion.''' It should be possible to
> extract a proof of key/value mapping which is limited in size and
> verification time by the length of the key in the worst case.
> * '''Computation of updates using local information.''' Given a set of
> inclusion proofs, it should be possible to calculate adjustments to
> the local mapping structure (update or deletion of included mappings,
> or insertion between two included mappings which are adjacent in the
> global structure).
>
> Such applications include committed validation indices which enable
> stateless mining nodes, committed wallet indices which enable
> trust-less querying of the unspent transaction output set by
> <code>scriptPubKey</code>, efficient document time-stamping, and
> secure & efficient merged mining. This BIP describes an authenticated
> prefix tree which has the above properties, but leaves the myriad
> applications to be formalized in future BIPs.
>
> ==Data structure==
>
> This BIP defines a binary prefix tree. Such a structure provides a
> mapping of bitstrings (the ''keys'') to bytestrings (the ''values'').
> It is an acyclic binary tree which implicitly encodes keys within the
> traversal path -- a "left" branch is a 0, and a "right" branch is a 1.
> Each node is reachable by only one unique path, and reading off the
> branches taken (0 for each left, 1 for each right) as one follows the
> path from root to target yields the node's key.
>
> The particular binary prefix tree defined by this BIP is a hybrid
> PATRICIA / de la Brandais tree structure.
> [http://en.wikipedia.org/wiki/Radix_tree PATRICIA trees] compress a
> long sequence of non-branching nodes into a single interior node with
> a per-branch ''skip prefix''. This achieves significant savings in
> storage space, root hash calculation, and traversal time.
>
> A de la Brandais trie achieves compression by only storing branches
> actually taken in a node. The space savings are minimal for a binary
> tree, but place the serialized size of a non-branching interior node
> under the SHA-256 block size, thereby reducing the number of hash
> operations required to perform updates and validate proofs.
>
> This BIP describes the authenticated prefix tree and its many
> variations in terms of its serialized representation. Additional BIPs
> describe the application of authenticated prefix trees to such
> applications as committed indices, document time-stamping, and merged
> mining.
>
> ==Serialization format==
>
> As a hierarchical structure, the serialization of an entire tree is
> the serialization of its root node. A serialized node is the
> concatenation of five structures:
>
>      node := flags || VARCHAR(extra) || value || left || right
>
> The <code>flags</code> is a single byte field whose composite values
> determine the bytes that follow.
>
>      flags = (left_flags  << 0) |
>              (right_flags << 2) |
>              (has_value   << 4) |
>              (prune_left  << 5) |
>              (prune_right << 6) |
>              (prune_value << 7)
>
> The <code>left_flags</code> and <code>right_flags</code> are special
> 2-bit enumeration fields. A value of 0 indicates that the node does
> not branch in this direction, and the corresponding <code>left</code>
> or <code>right</code> branch is missing (replaced with the empty
> string in the node serialization). A value of 1 indicates a single bit
> key prefix for this branch, implicitly 0 for <code>left</code> and 1
> for <code>right</code>. A 2 indicates up to 7 bits of additional skip
> prefix (beyond the implicit first bit, making 8 bits total) are stored
> in a compact single-byte format. A 3 indicates a skip prefix with
> greater than 7 additional bits, stored length-prefix encoded.
>
> The single bit <code>has_value</code> indicates whether the node
> stores a data bytestring, the value associated with its key prefix.
> Since keys may be any value or length, including one key being a
> prefix of another, it is possible for interior nodes in addition to
> leaf nodes to have values associated with them, and therefore an
> explicit value-existence bit is required.
>
> The remaining three bits are used for proof extraction, and are masked
> away prior to hash operations. <code>prune_left</code> indicates that
> the entire left branch has been pruned. <code>prune_right</code> has
> similar meaning for the right branch. If <code>has_value</code> is
> set, <code>prune_value</code> may be set to exclude the node's value
> from encoded proof. This is necessary field for interior nodes, since
> it is possible that their values may be pruned while their children
> are not.
>
> The <code>value</code> field is only present if the bit
> <code>flags.has_value</code> is set, in which case it is a
> <code>VARCHAR</code> bytestring:
>
>      switch flags.has_value:
>        case 0:
>          value := ε
>        case 1:
>          value := VARCHAR(node.value)
>
> The <code>extra</code> field is always present, and takes on a
> bytestring value defined by the particular application. Use of the
> <code>extra</code> field is application dependent, and will not be
> covered in this specification. It can be set to the empty bytestring
> (serialized as a single zero byte) if the application has no use for
> the <code>extra</code> field.
>
>      value := VARCHAR(calculate_extra(node))
>
> The <code>left</code> and <code>right</code> non-terminals are only
> present if the corresponding <code>flags.left_flags</code> or
> <code>flags.right_flags</code> are non-zero. The format depends on the
> value of this flags setting:
>
>      switch branch_flags:
>        case 0:
>          branch := ε
>        case 1:
>          branch := branch_node_or_hash
>        case 2:
>          prefix  = prefix >> 1
>          branch := int_to_byte(1 << len(prefix) | bits_to_int(prefix)) ||
>                    branch_node_or_hash
>        case 3:
>          prefix  = prefix >> 1
>          branch := VARINT(len(prefix) - 9) ||
>                    bits_to_string(prefix) ||
>                    branch_node_or_hash
>
> <code>branch_flags</code> is a stand-in meant to describe either
> <code>left_flags</code> or <code>right_flags</code>, and likewise
> everywhere else in the above pseudocode <code>branch</code> can be
> replaced with either <code>left</code> or <code>right</code>.
>
> <code>prefix</code> is the key bits between the current node and the
> next branching, terminal, and/or leaf node, including the implicit
> leading bit for the branch (0 for the left branch, 1 for the right
> branch). In the above code, <code>len(prefix)</code> returns the
> number of bits in the bitstring, and <code>prefix >> 1</code> drops
> the first bit reducing the size of the bitstring by one and
> renumbering the indices accordingly.
>
> The function <code>int_to_byte</code> takes an integer in the range
> [0, 255] and returns the octet representing that value. This is a NOP
> in many languages, but present in this pseudocode so as to be explicit
> about what is going on.
>
> The function <code>bits_to_int</code> interprets a sequence of bits as
> a little-endian integer value. This is analogous to the following
> pseudocode:
>
>      def bits_to_int(bits):
>          result = 0
>          for idx in 1..len(bits):
>              if bits[idx] == 1:
>                  result |= 1<<idx
>
> The function <code>bits_to_string</code> serializes a sequence of bits
> into a binary string. It uses little-endian bit and byte order, as
> demonstrated by the following pseudocode:
>
>      def bits_to_string(bits):
>          bytes = [0] * ceil(len(bits) / 8)
>          for idx in 1..len(bits):
>              if bits[idx] == 1:
>                  bytes[idx / 8] |= 1 << idx % 8
>          return map(int_to_byte, bytes)
>
> <code>branch_node_or_hash</code> is either the serialized child node
> or its SHA-256 hash and associated meta-data. Context determines which
> value to use: during digest calculations, disk/database serialization,
> and when the branch is pruned the hash value is used and serialized in
> the same way as other SHA-256 values in the bitcoin protocol (note
> however that it is single-SHA-256, not the double-SHA-256 more
> commonly used in bitcoin). The number of terminal (value-containing)
> nodes and the serialized size in bytes of the fully unpruned branch
> are suffixed to the branch hash. When serializing a proof or
> snapshotting tree state and the branch is not pruned, the serialized
> child node is included directly and the count and size are omitted as
> they can be derived from the serialization.
>
>      if branch_pruned or SER_HASH:
>          branch_node_or_hash := SHA-256(branch) ||
>                                 count(branch) ||
>                                 size(branch)
>      else:
>          branch_node_or_hash := serialize(branch)
>
> As an example, here is the serialization of a prefix tree mapping the
> names men and women of science to the year of their greatest publication:
>
>      >>> dict = AuthTree()
>      >>> dict['Curie'] = VARINT(1898)
>      >>> dict('Einstein') = VARINT(1905)
>      >>> dict['Fleming'] = VARINT(1928)
>      >>> dict['中本'] = VARINT(2009)
>      >>> dict.serialize()
>      # An bytestring, broken out into parts:
>
>      # . Root node:
>      0x0e # left_flags: 2, right_flags: 3, has_value: 1
>      0x00 # extra: ε
>
>      # .l Inner node: 0b01000
>      0x11 # 0b01000
>      0x07 # left_flags: 3, right_flags: 1
>      0x00 # extra: ε
>
>      # .l.l Inner node: 0b01000011 0b01110101 0b01110010 0b01101001
>      #                  'C'        'u'        'r'        'i'
>      #                  0b01100101
>      #                  'e'
>      0x1abb3a599a02 # 0b01101110101011100100110100101100101
>      0x10           # has_value: 1
>      0x00           # extra: ε
>      0x03fd6a07     # value: VARINT(1911)
>
>      # .l.r Inner node: 0b010001
>      0x0f # left_flags: 3, right_flags: 3
>      0x00 # extra: ε
>
>      # .l.r.l Inner node: 0b01000101 0b01101001 0b01101110 0b01110011
>      #                    'E'        'i'        'n'        's'
>      #                    0b01110100 0b01100101 0b01101001 0b01101110
>      #                    't'        'e'        'i'        'n'
>      0x312ded9c5d4c2ded00 # 0b1011010010110111
>                           # 0b0011100110111010
>                           # 0b0011001010110100
>                           # 0b101101110
>      0x10                 # has_value: 1
>      0x00                 # extra: ε
>      0x03fd7107           # value: VARINT(1905)
>
>      # .l.r.r Inner node: 0b01000110 0b01101100 0b01100101 0b01101101
>      #                    'F'        'l'        'e'        'm'
>      #                    0b01101001 0b01101110 0b01100111
>      #                    'i'        'n'        'g'
>      0x296c4c6d2dedcc01 # 0b0011011000110010
>                         # 0b1011011010110100
>                         # 0b10110111001100111
>      0x10               # has_value: 1
>      0x00               # extra: ε
>      0x03fd8807         # value: VARINT(1928)
>
>      # .r Inner node: 0b11100100 0b10111000 0b10101101
>      #                '中'
>      #                0b11100110 0b10011100 0b10101100
>      #                '本'
>      0x27938edab39c1a # 0b1100100101110001
>                       # 0b0101101111001101
>                       # 0b001110010101100
>      0x10             # has_value: 1
>      0x00             # extra: ε
>      0x03fdd907       # value: VARINT(2009)
>
> ==Hashing==
>
> There are two variations of the authenticated prefix tree presented in
> this draft BIP. They differ only in the way in which hash values of a
> node and its left/right branches are constructed. The variations,
> discussed below, tradeoff computational resources for the ability to
> compose operational proofs. Whether the performance hit is
> significant, and whether or not the added features are worth the
> tradeoff depends very much on the application.
>
> ===Variation 1: Level-compressed hashing===
>
> In this variation the referenced child node's hash is used in
> construction of an interior node's hash digest. The interior node is
> serialized just as described (using the child node's digest instead of
> inline serialization), the resulting bytestring is passed through one
> round of SHA-256, and the digest that comes out of that is the hash
> value of the node. This is very efficient to calculate, requiring the
> absolute minimum number of SHA-256 hash operations, and achieving
> level-compression of computational resources in addition to reduction
> of space usage.
>
> For example:
>
>      >>> dict = AuthTree()
>      >>> dict['a'] = 0xff
>      >>> dict.serialize()
>      0x0200c3100001ff
>      >>> dict.root
>      AuthTreeNode(
>          left_prefix = 0b01100001,
>          left_hash   =
> 0xbafa0e2bba3396c5e9804b6cbe61be82bc442c1121aed81f8d5de36e9b20dc2f,
>          left_count  = 1,
>          left_size   = 4)
>      >>> dict.hash
>      0xb4837376022a7c9ddaa7d685ad183bcbd5d16c362b81fa293a7b9e911766cf3c
>
> Assuming uniform distribution of key values, level-compressed hashing
> has time-complexity logarithmic with respect to the number of keys in
> the prefix tree. The disadvantage is that it is not possible in
> general to "rebase" an operational proof on top of a sibling,
> particularly if that sibling deletes branches that result in
> reorganization and level compression of internal nodes used by the
> rebased proof.
>
> ===Variation 2: Proof-updatable hashing===
>
> In this variation, level-compressed branches are expanded into a
> series of chained single-branch internal nodes, each including the
> hash of its direct child. For a brach with a prefix N bits in length,
> this requires N chained hashes. Thanks to node-compression (excluding
> empty branches from the serialization), it is possible for each hash
> operation + padding to fit within a single SHA-256 block.
>
> Note that the serialization semantics are unchanged! The variation
> only changes the procedure for calculating the hash values of interior
> nodes. The serialization format remains the same (modulo differing
> hash values standing in for pruned branches).
>
> Using the above example, calling <code>dict.hash</code> causes the
> following internal nodes to be constructed:
>
>      >>> node1 = AuthTreeNode(
>          right_prefix = 0b1,
>          right_hash   =
> 0xbafa0e2bba3396c5e9804b6cbe61be82bc442c1121aed81f8d5de36e9b20dc2f,
>          right_count  = 1,
>          right_size   = 4)
>      >>> node2 = AuthTreeNode( left_prefix=0b0,  left_hash=node1.hash,
>   left_count=1,  left_size=4)
>      >>> node3 = AuthTreeNode( left_prefix=0b0,  left_hash=node2.hash,
>   left_count=1,  left_size=4)
>      >>> node4 = AuthTreeNode( left_prefix=0b0,  left_hash=node3.hash,
>   left_count=1,  left_size=4)
>      >>> node5 = AuthTreeNode( left_prefix=0b0,  left_hash=node4.hash,
>   left_count=1,  left_size=4)
>      >>> node6 = AuthTreeNode(right_prefix=0b1, right_hash=node5.hash,
> right_count=1, right_size=4)
>      >>> node7 = AuthTreeNode(right_prefix=0b1, right_hash=node6.hash,
> right_count=1, right_size=4)
>      >>> node8 = AuthTreeNode( left_prefix=0b0,  left_hash=node7.hash,
>   left_count=1,  left_size=4,
>                                value=0xff)
>      >>> dict.hash == node8.hash
>      True
>      >>> dict.hash
>      0xc3a9328eff06662ed9ff8e82aa9cc094d05f70f0953828ea8c643c4679213895
>
> The advantage of proof-updatable hashing is that any operational proof
> may be "rebased" onto the tree resulting from a sibling proof, using
> only the information locally available in the proofs, even in the
> presence of deletion operations that result in level-compression of
> the serialized form. The disadvantage is performance: validating an
> updatable proof requires a number of hash operations lower-bounded by
> the length of the key in bits.
>
> ==Inclusion proofs==
>
> An inclusion proof is a prefix tree pruned to contain a subset of its
> keys. The serialization of an inclusion proof takes the following form:
>
>      inclusion_proof := variant || root_hash || root_node || checksum
>
> Where <code>variant</code> is a single-byte value indicating the
> presence of level-compression (0 for proof-updatable hashing, 1 for
> level-compressed hashing). <code>root_hash</code> is the Merkle
> compression hash of the tree, the 32-byte SHA-256 hash of the root
> node. <code>tree</code> is the possibly pruned, serialized
> representation of the tree. And finally, <code>checksum</code> is the
> first 4 bytes of the SHA-256 checksum of <code>variant</code>,
> <code>root_hash</code>, and <code>root_node</code>.
>
> For ease of transport, the standard envelope for display of an
> inclusion proof is internet-standard base64 encoding in the following
> format:
>
> - -----BEGIN INCLUSION PROOF-----
> ATzPZheRnns6KfqBKzZs0dXLOxithdan2p18KgJ2c4O0DgARBwAauzpZmgIQAAP9agcPADEt7Zxd
> TC3tABAAA/1xBylsTG0t7cwBEAAD/YgHJ5OO2rOcGhAAA/3ZByEg+2g=
> - -----END INCLUSION PROOF-----
>
> Decoded, it looks like this:
>
>      0x01 # Level-compressed hashing
>      # Merkle root:
>      0x3ccf6617919e7b3a29fa812b366cd1d5cb3b18ad85d6a7da9d7c2a02767383b4
>      # Serialized tree (unpruned):
>      0x0e001107001abb3a599a02100003fd6a070f00312ded9c5d4c2ded00100003fd
>      0x7107296c4c6d2dedcc01100003fd880727938edab39c1a100003fdd907
>      # Checksum:
>      0x2120fb68
>
> ==Operational proofs==
>
> An operational proof is a list of insert/update and delete operations
> suffixed to an inclusion proof which contains the pathways necessary
> to perform the specified operations. The inclusion proof must contain
> the key values to be updated or deleted, and the nearest adjacent key
> values for each insertion. The serialization of an operational proof
> takes the following form:
>
>      operational_proof := variant || root_hash || tree ||
>                           VARLIST(delete*) || VARLIST(update*) ||
>                           new_hash || checksum
>
>      delete := VARCHAR(key)
>      update := VARCHAR(key) || VARCHAR(value)
>
> The first three fields, <code>variant</code>, <code>root_hash</code>,
> and <code>tree</code> are the inclusion proof, and take the same
> values described in the previous section. <code>deletes</code> is a
> list of key values to be deleted; each key value in this list must
> exist in the inclusion proof. <code>updates</code> is a list of key,
> value mappings which are to be inserted into the tree, possibly
> replacing any mapping for the key which already exists; either the key
> itself if it exists (update), or the two lexicographically closest
> keys on either side if it does not (insert) must be present in the
> insertion proof. <code>new_hash</code> is the resulting Merkle root
> after the insertion, updates, and deletes are performed, and
> <code>checksum</code> is the initial 4 bytes of the SHA-256 hash of
> the preceding fields.
>
> Just like inclusion proofs, an operational proof is encoded in base64
> for display and transport. Here's the same
>
> - -----BEGIN OPERATIONAL PROOF-----
> ATzPZheRnns6KfqBKzZs0dXLOxithdan2p18KgJ2c4O0LgARaIsVaQi/GdhOPOgA8p4Pu4PiEfEg
> lcmy3j7bOc7hXw0DLSeTjtqznBoQAAP92QcBMOS4reacrACzuZJbyP7fqIOf5VEk4iarG4+uPoZC
> oun8BztQMQBy0LHVeSY=
> - -----END OPERATIONAL PROOF-----
>
> Decoded and broken into its constituent fields:
>
>      0x01 # Level-compressed hashing
>      # Original Merkle root:
>      0x3ccf6617919e7b3a29fa812b366cd1d5cb3b18ad85d6a7da9d7c2a02767383b4
>      # Serialized tree (included keys: '中本'):
>      0x2e0011688b156908bf19d84e3ce800f29e0fbb83e211f12095c9b2de3edb39ce
>      0xe15f0d032d27938edab39c1a100003fdd907
>      # Deletion list ['中本']:
>      0x01
>      0x30e4b8ade69cac
>      # Insertion list []:
>      0x00
>      # New Merkle root:
>      0xb3b9925bc8fedfa8839fe55124e226ab1b8fae3e8642a2e9fc073b50310072d0
>      # Checksum:
>      0xb1d57926
>
> ~End of File~
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.14 (GNU/Linux)
> Comment: GPGTools - http://gpgtools.org
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQIcBAEBAgAGBQJSs6HIAAoJEAdzVfsmodw4gooQAJm7XNsZjgdeTSpKIvUIU38f
> tQx2FD08hQdLl48me5mDUbHJgGlYINsKAgoZ8Mqwi/kHEEYhuIlLIX1p6Ovigidb
> 21BiVoOLdG1egGOwxp17DuwYaDPTppFTlN9TBjZzW6WKc7+4aNvyc1KtrbHIhtj/
> 04ekFyAn4U5UH0ht7CI79j0u3Kp85p5D4PyYZB2m82mzti6OxpSM4tXlMkDW7ihg
> QJwiZSjzejqTd7WF0zr0SLeGVRSN2A0dzUCoVsI98eIa3hkw2N4ae6dRkibyStOT
> V8VEDvHArEDlvu8jiryajhsom5mvtOOclNDkVXWAf/Te4gj05iYdTIvNvDEJtqsP
> XDbmw6GgV1kBLlLo0mp//t/+wr+nIvy+sVAP+eqtM/0vjaVXBkXxkUMqqNkrtVpB
> f3whq7nFahssUMSoWE93jgob1ayAax2XUALVMAXYsJl7b2MqBGlhiTZ8FQZ+TW4A
> tIpKeUprPmDvA18rO3SCbmLMQryZqYiH0sRyvUc5kvn3qCRHrISZNkEuK591eS+x
> BO1eOluPzVqeXPPSK1jvGeY0FNJtwzbov4nI1mzOvzQHLCvkHn5PhUFCK5tL5tAe
> b0Z5qwDV+SvVs7W1R7ejYBzEj77U1zuzZ9AtikOuvy+bNGrkIlpI49EyXHijm7C3
> Q6JacTuI0PelYji2gaBJ
> =BbDs
> -----END PGP SIGNATURE-----
>
> ------------------------------------------------------------------------------
> Rapidly troubleshoot problems before they affect your business. Most IT
> organizations don't have a clear picture of how application performance
> affects their revenue. With AppDynamics, you get 100% visibility into your
> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
> _______________________________________________
> Bitcoin-development mailing list
> Bitcoin-development@lists•sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bitcoin-development




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Bitcoin-development] BIP proposal: Authenticated prefix trees
  2014-01-05 18:43 ` Thomas Voegtlin
@ 2014-01-06 18:13   ` Peter Todd
  2014-01-07  0:21     ` Mark Friedenbach
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Todd @ 2014-01-06 18:13 UTC (permalink / raw)
  To: Thomas Voegtlin; +Cc: bitcoin-development

[-- Attachment #1: Type: text/plain, Size: 1558 bytes --]

On Sun, Jan 05, 2014 at 07:43:58PM +0100, Thomas Voegtlin wrote:
> Hello and happy new year to this mailing list!
> 
> 
> Thank you Mark for the incredible work you've been doing on this.
> I am following this very closely, because it is of primary importance
> for Electrum.
> 
> I have written a Python-levelDB implementation of this UTXO hashtree,
> which is currently being tested, and will be added to Electrum servers.

Along the lines of my recent post on blockchain data:

Is it going to be possible to do partial prefix queries on that tree?

Also have you considered creating per-block indexes of all
scriptPubKeys, spent or unspent, queryable via the same partial prefix
method?

> I too believe that BIPs should define interoperability points, but probably
> not implementation details. For the UTXO hashtree, this means that a BIP
> should at least specify how the root hash is constructed. This might be the
> only thing that needs to be specified.
> 
> However, I see no pressing issue with writing a BIP; it might be preferable
> to implement and test different options first, and learn from that.

It'd be very good to test this stuff thoroughly on Electrum first and
get a feel for the performance and usability before any soft-fork to
make it a miner commitment.

Similarly a C++ implementation should be simply added to Bitcoin Core as
a bloom filter replacement and made available over the P2P network.

-- 
'peter'[:-1]@petertodd.org
000000000000000009bc28e08b41a74801c5878bf87978c2486aee7ed8a85778

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Bitcoin-development] BIP proposal: Authenticated prefix trees
  2014-01-06 18:13   ` Peter Todd
@ 2014-01-07  0:21     ` Mark Friedenbach
  2014-01-07  6:31       ` Thomas Voegtlin
  0 siblings, 1 reply; 14+ messages in thread
From: Mark Friedenbach @ 2014-01-07  0:21 UTC (permalink / raw)
  To: bitcoin-development

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/06/2014 10:13 AM, Peter Todd wrote:
> On Sun, Jan 05, 2014 at 07:43:58PM +0100, Thomas Voegtlin wrote:
>> I have written a Python-levelDB implementation of this UTXO
>> hashtree, which is currently being tested, and will be added to
>> Electrum servers.
> 
> Along the lines of my recent post on blockchain data:
> 
> Is it going to be possible to do partial prefix queries on that
> tree?

There's really two tree structures being talked about here. Correct me
if I'm wrong Thomas, but last time I looked at your code it was still
implementing a 256-way PATRICIA trie, correct? This structure lends
itself to indexing either scriptPubKey or H(scriptPubKey) with
approximately the same performance characteristics, and in the
"Ultimate blockchain compression" thread there is much debate about
which to use.

In the process of experimentation I've since moved from a 256-way
PATRICIA trie to a bitwise, non-level-compressed trie structure - what
I call proof-updatable trees in the BIP. These have the advantage of
allowing stateless application of one proof to another, and as
consequence enable mining & mempool operations without access to the
UTXO set, so long as proofs are initially provided in the transaction
& block wire format.

The "disadvantage" is that performance is closely tied to key length,
making H(scriptPubKey) the much more desirable option. I'm sure you
see that as an advantage, however :)

> Also have you considered creating per-block indexes of all 
> scriptPubKeys, spent or unspent, queryable via the same partial
> prefix method?

This would be quite easy to do, separate from the UTXO structure but
using the same trie format.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJSy0iFAAoJEAdzVfsmodw434MQAIA/fDYT7SfMtfLEgDQKhXCn
slRqFEx/HXjvgHHSYnbr9V+8LrGzNvT2ImebbV9ge8VlziAFNGIUq2EYhFs4kHWu
GVm9aL8Jj/27SvM0tRwr9n2XIifKOh2sVINAjbv+UwPv/O+cULU95/b53DEF6aqI
OWxioOR50TPe4t9AevAGVypNLm1DsyDdymhO9xyBN92xGTNj5QKL5hHG3kcsLIl1
7KaxO0w4UC2sdSGj9FeyH1b0zYg8FlzjJHc1CUshHwUwyYo8LRJtRypL5lrayERg
Er/kIGEDovcenNBW8G79l+8VKPfB/lMTssT2pDiQL+1e1fg46CIQxHSyap2JSFTE
jgleRk/+1NK/ZjOQ8dEBPZK3TE1WY3qlm/ekjG/8W5kXqcxzFBoAkeBNXuJ/8UMi
mKe+DTmbp0xnvLO1p+hpugXKfrQSpcFL+ZvJHlFS1lz7O1N3WvuDCNP9El+L6ueM
nFzjr1NTnX0z4vYtscI7qBKVqUrB7Z84c3O/lSYpw4Jilxl4trzV4cn7+AF7KWGM
ktR9JJeIoNcJ2Zx4EpRp6OSwhtLkWZyLpPnidQ2p6ev2ytXpTpGsW/i5XS2w57UD
2IG5E0Q7Xzvd58lI/YollWQcagVOZdyzYXa+wVZoFQ6gLF47andpUmtUJOhI7gxv
T/rWhPhkTMUn8TdvUcV/
=N9zM
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Bitcoin-development] BIP proposal: Authenticated prefix trees
  2014-01-07  0:21     ` Mark Friedenbach
@ 2014-01-07  6:31       ` Thomas Voegtlin
  2014-01-08  1:04         ` Mark Friedenbach
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Voegtlin @ 2014-01-07  6:31 UTC (permalink / raw)
  To: bitcoin-development


Le 07/01/2014 01:21, Mark Friedenbach a écrit :
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 01/06/2014 10:13 AM, Peter Todd wrote:
>> On Sun, Jan 05, 2014 at 07:43:58PM +0100, Thomas Voegtlin wrote:
>>> I have written a Python-levelDB implementation of this UTXO
>>> hashtree, which is currently being tested, and will be added to
>>> Electrum servers.
>> Along the lines of my recent post on blockchain data:
>>
>> Is it going to be possible to do partial prefix queries on that
>> tree?
> There's really two tree structures being talked about here. Correct me
> if I'm wrong Thomas, but last time I looked at your code it was still
> implementing a 256-way PATRICIA trie, correct? This structure lends
> itself to indexing either scriptPubKey or H(scriptPubKey) with
> approximately the same performance characteristics, and in the
> "Ultimate blockchain compression" thread there is much debate about
> which to use.

You are right. The 256-way branching follows from the fact that
the tree was implemented using a key-value database operating
with byte strings (leveldb). With this implementation constraint,
a different branching would probably be possible but wasteful.

My recent code creates one leaf per unspent, and uses 56-byte
keys built as:

   H(scriptPubKey) + txid + txpos

(This is not pushed yet, it needs cleanup. Previous code created one 
leaf per address)

Partial prefix queries are possible with database iterators.

> In the process of experimentation I've since moved from a 256-way
> PATRICIA trie to a bitwise, non-level-compressed trie structure - what
> I call proof-updatable trees in the BIP. These have the advantage of
> allowing stateless application of one proof to another, and as
> consequence enable mining & mempool operations without access to the
> UTXO set, so long as proofs are initially provided in the transaction
> & block wire format.

I see the advantage of doing that, but this looks really far-fetched..
My understanding is that it would require a complete change in the
way clients and miners work. Could such a change be brought iteratively?





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Bitcoin-development] BIP proposal: Authenticated prefix trees
  2014-01-07  6:31       ` Thomas Voegtlin
@ 2014-01-08  1:04         ` Mark Friedenbach
  0 siblings, 0 replies; 14+ messages in thread
From: Mark Friedenbach @ 2014-01-08  1:04 UTC (permalink / raw)
  To: Thomas Voegtlin, Bitcoin Dev

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/06/2014 10:31 PM, Thomas Voegtlin wrote:
> You are right. The 256-way branching follows from the fact that the
> tree was implemented using a key-value database operating with byte
> strings (leveldb). With this implementation constraint, a different
> branching would probably be possible but wasteful.

Not really. Just use a suffix to determine the number of bits used in
the final key byte. For example, the string "abc" would have the key

    0x61626308 // "abc\x08"

Dropping the final bit would mean masking it off and having a
different terminating value:

    0x61626207 // "abb\x07"

That way you keep the lexical ordering of keys necessary for database
iteration, and the efficient binary encoding.

> I see the advantage of doing that, but this looks really
> far-fetched.. My understanding is that it would require a complete
> change in the way clients and miners work. Could such a change be
> brought iteratively?

It is an iterative change, I believe. You might be confusing this idea
with Peter Todd's TXO commitment proposal using MMR trees, which is a
drastic change with its own set of tradeoffs. Just to be clear, here's
what I'm proposing:

1) Restructure the current UTXO index to be a Merkle tree, basically
by splitting coins into individual outputs and adding interior nodes
to the leveldb database.

2) Add hash commitments of this structure to the coinbase.

It's still mapping txid's to unspent outputs, just as before - this
has nothing to do with the script keyed "wallet index." It's just now
nodes can prefix optional proofs to block or transaction messages
which prove by reference to the current best block's hash the spend
status of the inputs of a transaction, or all the inputs of all the
transactions of a block.

If the more expensive proof-updatable hashing is used, then these
proofs can even be composed or "rebased" onto a new block by applying
the contents of an "operational proof" representing the diff between
two blocks / the application of a series of transactions.

This means that a node which does not have access to the UTXO set can
nevertheless receive transactions or entire blocks with prefixed
proofs and check the validity of the transaction with just the
information available (proof + transaction contents).

All that is required after the above soft-fork is a protocol version
update and/or a service bit to indicate the ability to send or receive
proof-prefixed messages. I'd call that an incremental update.

[Aside: adding the wallet index requires storing the entire UTXO set
in duplicated form, indexed this time by scriptPubKey or
H(scriptPubKey), and including proofs of this structure as well. It is
unlikely that any soft-fork would occur forcing consensus over the
wallet index, but it could be done as a meta-chain or as an index
covering just the contents of the block.]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJSzKQ2AAoJEAdzVfsmodw4hyoQAJ0f6P3ijZCEw7IPd/RcrmkI
Viv4j17ZyAAcbNUplvjzhr/tIIKYPg51ltvfkp8cGRHgez88QsljzvM8B5n+nbPa
jaaI6eiJ3AU1bR8hWYKtlXFwMvRjyr3ofl8hhTvYptGv9x3/Tr+2FwxIRY0413m6
2h95vItsvBs8v7clqLoBEqx9uyUpsH3+J32V4oGubrNAFXh1oOHi4Ban+TOKYaQV
GHZaIZ3bVAvcMd5riaWSPUPLHwJnxQ8w6SlVRy2UNUPe+9yTuy4n1GW4vk4WHvop
FgZFrM3LBmh1MhlYHRdEUUtwk3mfDuGbfW5UJVMri0Nis1PsXr5VK4qQaMbd/9e6
M2uWKslY9QCnzMajnHen9OwotteAJy2I1KHVcxXb0tFqrvqZ6o/auIe0G4VdKYuI
XfNF3mokX93tiSflmphDba6qgB/W+Y6UD2gG2AeFuMGhFF/Hy62pVC6Zx7PKZ3vL
Kh27rKkO/0FJau2JCQm5xBiQgCnKghqOiHefY3o+l+Y9kJ8fXKWCuwJ0lJ3LxZ2u
8H6sp6Jm9Ct9L90wSn7VmmI5H3bRe8sa7sylH4BR2T6jP3/tKDYTEeNWj+F9FfO1
FxsjYrjAyv1HxYYKd/Y1svEVSsKMv3a2SR9pF36ynBABdFjvx+oEuCyCO4tspFe6
15eA1QoMKvEQe/Ww5kRC
=L9WT
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-01-08  1:05 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-20  1:47 [Bitcoin-development] BIP proposal: Authenticated prefix trees Mark Friedenbach
2013-12-20  6:48 ` Jeremy Spilman
2013-12-20 11:21   ` Mark Friedenbach
2013-12-20 13:17     ` Peter Todd
2013-12-20 18:41       ` Mark Friedenbach
2013-12-20 10:48 ` Peter Todd
     [not found]   ` <52B425BA.6060304@monetize.io>
2013-12-20 12:47     ` Peter Todd
2013-12-20 19:48 ` Gregory Maxwell
2013-12-20 22:04   ` Mark Friedenbach
2014-01-05 18:43 ` Thomas Voegtlin
2014-01-06 18:13   ` Peter Todd
2014-01-07  0:21     ` Mark Friedenbach
2014-01-07  6:31       ` Thomas Voegtlin
2014-01-08  1:04         ` Mark Friedenbach

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox