On Wed, Nov 28, 2012 at 11:43:19AM +0100, Mike Hearn wrote:
> Peter is correct that there are a few degrees of freedom in protobuf
> serialization, though far fewer than with JSON.

FWIW I re-read the specs again and turns out my memory was wrong. (I
last looked at this about four months ago) Duplicated fields are handled
in a defined manner, with the last field seen in the serialization being
the one whose value is used. Again, repeated fields are treated as
elements of a list, preserving order.

It does raise the interesting question do the implementations that don't
preserve order of unknown fields, preserve the order of multiple unknown
fields, either repeated or not?

> I'd like to think upstream would be open to resolving these
> ambiguities.

I gotta admit, I suspect they won't be that open. Protocol buffers was
designed because Google needed a fast serialization method suitable for
many different internal projects. Needing round-trip idempotence seems
like a rare requirement to me, especially for internal use.

> Re-serialization of an Invoice message in the Payment message is a
> potential source of mistakes. There's no need to ever concatenate
> these messages and alternative implementations that don't order
> serialized fields by tag number are missing an important optimization,
> so they could be fixed. The main issue is treatment of unknown fields.
> If/when the Invoice message is extended with other fields that are
> round-tripped through an old client, the data may get lost. JSON
> doesn't help resolve that either, of course. There are a few
> solutions:

Well, actually you can take advantage of the message concatination
ability of protocol buffers to extend a message by simply appending the
new fields to the existing thus either defining new fields, or
overriding old values as required. If you want to de-duplicate though
you run into the problem all over again.

On the other hand JSON handles this case fine too provided that your
JSON implementation supports dictionary objects with arbitrary fields.
Just use the object as is and the unknown fields will be re-serialized
properly at the other end. Some implementations will have to be careful
to handle collisions with existing keys in the namespace. (consider in
Python what would happen if you mapped your object to a class instance,
and the serialization included the key "__init__")

That said, JSON is quite problematic with numbers. For instance, you
have to be careful to keep integers represented as pure integers below
what Javascript can handle, the maximum integer exactly representable in
a double float, or the JSON won't be parsable in Javascript even if many
other languages handle it fine. Protocol buffers is at least pretty
explicit about what size integers are.

> 1) Change the type of the Invoice field in Payment to be "bytes" and
> set it to be the hash of the originally received binary Invoice
> message. Downside, requires merchants to track all outstanding
> invoices.
> 2) Ask protobufs upstream to modify the spec/implementations so
> ordering of unknown fields is specified. The Python implementation
> could be extended to support them so Python implementors don't end up
> with accidental message downgrades.
> 3) Language of the spec could be changed to explicitly state that the
> received Invoice may not be binary-identical to the one that was sent,
> in the case of a client that incorrectly downgrades the message. Thus
> you'd be expected to check what the Invoice was using merchant_data
> which is opaque and could just be, eg, a database key on your own end.
> 4) Instead of submitting the entire Invoice back to the merchant, just
> the merchant_data could be in the Payment message.
> 
> Of the four options I prefer the last. What is the use case for
> resubmitting the entire invoice anyway? Even if protobufs are improved

Note that I think the SignedInvoice message itself is broken, because
protobuf implementations have no reason to guarantee that they can give
you the serialized bytes of the Invoice sub-message. It's a quite
specific use-case that isn't needed for pretty much anything but crypto.
FWIW I took a quick look at the official API's, C++, Java and Python,
and as far as I can tell none of them support accessing the binary
serialization of a message field other than by re-serializing the
message.

Really the invoice field should be declared as bytes serialized_invoice,
as inconvenient as that is to work with.

> so handling of round-tripping new messages through old [Python]
> clients is more rigorous, some implementors will probably convert the
> protobuf objects into some internal forms for whatever reason (or
> serialize them to a database, etc) and they're very likely to mess up
> the handling of unknown fields when they do it.

Since the Payment message includes an *untrusted* Invoice that the
vendor needs to authenticate the whole invoice no matter what on Payment
reception. In many cases that implies they have to keep some sort of
database of "quotes" or similar anyway as the client can change anything
they want otherwise. Again that leads back to the argument of why not
just stick with the merchant_dat as you suggest, which will usually be
some short invoice number attached to a database? A vendor that wants to
operation a stateless invoicing system can just stuff a HMAC-protected
serialized invoice into the merchant_data

I guess you could use a mutable invoice field as a way of achieving some
sort of negotiation protocol, but I think it's better to stick to the
original concept of just ensuring that the user is really paying the
right amount to the right address.

-- 
'peter'[:-1]@petertodd.org