On Mon, Nov 26, 2012 at 05:37:31PM -0500, Gavin Andresen wrote: > Why not JSON? > ------------- > > Invoice, Payment and Receipt messages could all be JSON-encoded. And > the Javascript Object Signing and Encryption (JOSE) working group at > the IETF has a draft specification for signing JSON data. > > But the spec is non-trivial. Signing JSON data is troublesome because > JSON can encode the same data in multiple ways (whitespace is > insignificant, characters in strings can be represented escaped or > un-escaped, etc.), and the standards committee identified at least one > security-related issue that will require special JSON parsers for > handling JSON-Web-Signed (JWS) data (duplicate keys must be rejected > by the parser, which is more strict than the JSON spec requires). > > A binary message format has none of those complicating issues. Which > encoding format to pick is largely a matter of taste, but Protocol > Buffers is a simple, robust, multi-programming-language, > well-documented, easy-to-work-with, extensible format. I'm not sure this is actually as much of an advantage as you'd expect. I looked into Google Protocol buffers a while back for a timestamping project and unfortunately there are many ways in which the actual binary encoding of a message can differ even if the meaning of the message is the same, just like JSON. First of all while the order in which fields are encoded *should* be written sequentially, parsers are also required to accept the fields in any order. There is also a repeated fields feature where the fields can either be serialized as one packed key-list pair, or multiple key-value(s) pairs; in the latter case the payloads are concatenated. The general case of how to handle a duplicated field that isn't supposed to be repeated seems to be undefined in the standard. Yet at the same time the standard mentions creating messages by concatenating two messages together. Presumably parsers treat that case as an error, but I wouldn't be surprised if that isn't always true. Implementations differ as well. The current Java and C++ implementations write unknown fields in arbitrary order after the sequentially-ordered known fields, while on the other hand the Python implementation simply drops unknown fields entirely. As far as I know no implementation preserves order for unknown fields. Finally, while not a Protocol Buffers specific problem, UTF8 encoded text isn't guaranteed to survive a UTF8-UTFx-UTF8 round trip. Multiple code point sequences can be semanticly identical so you can expect some software to convert one to the other. Similarly lots of languages internally store unicode strings by converting to something like UTF16. One solution is to use one of the normalization forms such as NFKD - an idempotent transformation - although I wouldn't be surprised if normalization itself is complex enough that implementation bugs exist, not to mention the fact that the normalization forms have undergone different versions. I think the best way(1) to handle (most) the above by simply treating the binary message as immutable and never re-serializing a deserialized message, but if you're willing to do that just using JSON isn't unreasonable either. 1) Of course I went off an created Yet Another Binary Serialization for my project, but I'm young and foolish... -- 'peter'[:-1]@petertodd.org