Using multihashes as message IDs #5

whereswaldon · 2018-11-24T15:03:26Z

To move arbor from using UUIDs to identify Posts (messages) to using verifiable hashes of message data (thereby making the chat history into a Merkle Tree), a procedure for obtaining those hashes must be established. This is a proposal for how to obtain those hashes:

Firstly, let's use multibase-encoded multihashes to avoid being locked in to any specific hash implementation in the long run.

Secondly, let's assume that Posts have the following schema:

{
  "parent": "multibase-encoded-multihash",
  "content": "the message itself (UTF8 string)",
  "username": "who sent it (UTF8 string)",
  "timestamp": 64-bit number of nanoseconds since Epoch,
  "meta": {
    "nosig/pgp": "multibase-encoded-message-signature",
    "arbitrary": "some other client protocol extension data"
  }
}

Posts are not necessarily represented as JSON, but it serves to demonstrate their inherent structure.

To derive the ID of a Post, do the following:

Create an empty buffer of bytes
Write the binary data of the "parent" field into the buffer. This is likely to involve decoding the binary data from its current multibase (e.g. the data might be encoded as base58, so we must decode it into binary before writing it into the buffer).
Write the data of the "username" field (padded out to its max width of 64 bytes with NULL bytes) into the buffer, but without any terminating NULL byte
Write the 64-byte representation of the timestamp into the buffer
Write the data of the "content" field into the buffer, but without any terminating NULL byte
Write one NULL byte (delimit the end of the "content")
Sort the keys in "meta" by ascending lexicographic order. For each key in the sorted list of keys:
1. Write the key string into the buffer
2. Write the value associated with that key (decode it into binary from whatever multibase it is in)
3. Write a NULL byte at the end of each value
Hash the result using the same multihash as the "parent".
Encode this hash using the same multibase as the "parent".
This is the Post "id".

The use of NULL bytes is intended to prevent collisions between messages with the same content, but different metadata. It would also work to length-prefix each piece of data, but that might be more complicated. Definitely open to suggestions there.

Another issue is whether the metadata field is detachable. A similar approach would be to sort and hash just the metadata field, and then include the multihash of the metadata in the hash of the whole post. This would allow you to validate a Post's integrity without having the metadata, but I'm uncertain whether that's desirable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using multihashes as message IDs #5

Using multihashes as message IDs #5

whereswaldon commented Nov 24, 2018

Using multihashes as message IDs #5

Using multihashes as message IDs #5

Comments

whereswaldon commented Nov 24, 2018