Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using multihashes as message IDs #5

Open
whereswaldon opened this issue Nov 24, 2018 · 0 comments
Open

Using multihashes as message IDs #5

whereswaldon opened this issue Nov 24, 2018 · 0 comments

Comments

@whereswaldon
Copy link
Member

To move arbor from using UUIDs to identify Posts (messages) to using verifiable hashes of message data (thereby making the chat history into a Merkle Tree), a procedure for obtaining those hashes must be established. This is a proposal for how to obtain those hashes:

Firstly, let's use multibase-encoded multihashes to avoid being locked in to any specific hash implementation in the long run.

Secondly, let's assume that Posts have the following schema:

{
  "parent": "multibase-encoded-multihash",
  "content": "the message itself (UTF8 string)",
  "username": "who sent it (UTF8 string)",
  "timestamp": 64-bit number of nanoseconds since Epoch,
  "meta": {
    "nosig/pgp": "multibase-encoded-message-signature",
    "arbitrary": "some other client protocol extension data"
  }
}

Posts are not necessarily represented as JSON, but it serves to demonstrate their inherent structure.

To derive the ID of a Post, do the following:

  1. Create an empty buffer of bytes
  2. Write the binary data of the "parent" field into the buffer. This is likely to involve decoding the binary data from its current multibase (e.g. the data might be encoded as base58, so we must decode it into binary before writing it into the buffer).
  3. Write the data of the "username" field (padded out to its max width of 64 bytes with NULL bytes) into the buffer, but without any terminating NULL byte
  4. Write the 64-byte representation of the timestamp into the buffer
  5. Write the data of the "content" field into the buffer, but without any terminating NULL byte
  6. Write one NULL byte (delimit the end of the "content")
  7. Sort the keys in "meta" by ascending lexicographic order. For each key in the sorted list of keys:
    1. Write the key string into the buffer
    2. Write the value associated with that key (decode it into binary from whatever multibase it is in)
    3. Write a NULL byte at the end of each value
  8. Hash the result using the same multihash as the "parent".
  9. Encode this hash using the same multibase as the "parent".
  10. This is the Post "id".

The use of NULL bytes is intended to prevent collisions between messages with the same content, but different metadata. It would also work to length-prefix each piece of data, but that might be more complicated. Definitely open to suggestions there.

Another issue is whether the metadata field is detachable. A similar approach would be to sort and hash just the metadata field, and then include the multihash of the metadata in the hash of the whole post. This would allow you to validate a Post's integrity without having the metadata, but I'm uncertain whether that's desirable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant