Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document MultiAddr codecs implemented by py-multiaddr #101

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
159 changes: 158 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,163 @@ TODO: specify the encoding (byte-array to string) procedure

TODO: specify the decoding (string to byte-array) procedure

### Codecs

Depending on the protocol of a Multiaddr component, different algorithms are used to
convert their values from/to binary representation. The name of the codec to use
for each protocol is noted in [protocols.csv](protocols.csv).

In general empty values in the string representation are always disallowed unless
explicitely noted otherwise. In case of conversion errors implementation must
refuse to process the given string/binary value and report the error to the caller
instead.

Depending on the codec type codecs may either be encoded using the standard variable
length encoding style, or into a specific static-length binary value without the
extra length information if this is noted in the respective codec's description.

All code examples are written in Python-based pseudo code and are optimized for
legibility rather than speed. In general you should always use existing libraries
and functions for performing the below conversions rather than rolling your own.

#### `fspath`

Encodes a local file system path with unspecified binary encoding. On platforms
not using POSIX-style forward slashes (`/`) for delimiting individual path
labels, such as Windows, implementations should automatically convert such
paths from their POSIX representation as necessary.

Protocols using the `fspath` encoding are only valid for the system they were
created for and must not be shared between different hosts.

#### `domain`

Encodes the given Unicode representation to the UTF-8 character encoding ([RFC 3629 Section 3](https://tools.ietf.org/html/rfc3629#section-3)), while using the [UTS-46 / RFC 5890](https://tools.ietf.org/html/rfc5890) input normalization and processing rules for canonicalization.

* String → Binary:
1. If feasible, normalize and validate the given input string according to [UTS-46 Section 4 (Processing)](https://www.unicode.org/reports/tr46/#Processing) and [UTS-46 Section 4.1 (Validity Criteria)](https://www.unicode.org/reports/tr46/#Validity_Criteria) with the following parameters:
* UseSTD3ASCIIRules = true
* CheckHyphens = true
* CheckBidi = true
* CheckJoiners = true
* Transitional_Processing = false
2. Convert the Unicode string to the UTF-8 character encoding as per [RFC 3629 Section 3 §4](https://tools.ietf.org/html/rfc3629#section-3).
* Binary → String:
Convert the UTF-8 encoded binary string to Unicode according to the rules of [RFC 3629 Section 3 §6](https://tools.ietf.org/html/rfc3629#page-5).

Examples of libraries for performing the above normalization step include the `idna.uts46_remap` function of the [Python idna](https://pypi.org/project/idna/) library.

#### `ip4`

Encodes an IPv4 address according to the conventional [dot-decimal notation](https://en.wikipedia.org/wiki/Dot-decimal_notation) first specificed in [RFC 3986 section 3.2.2 page 20 § 2](https://tools.ietf.org/html/rfc3986#page-20).

Protocols using this codec must encode it as binary value of exactly 4 bytes without
an extra length value.

* String → Binary:
ntninja marked this conversation as resolved.
Show resolved Hide resolved
1. Split the input string into parts at each dot (U+002E FULL STOP):
`sparts = str.split(".")`
2. Assert that exactly 4 string parts were created by the split operation:
`assert len(parts) == 4`
3. Convert each part from its ASCII base-10 number representation to an integer type, aborting if the conversion fails for any of the decimal string parts:
`octets = [int(p) for p in parts]`
4. Validate that each part of the resulting integer list is in rage 0 – 255:
`assert all(i in range(0, 256) for i in octets)`
4. Copy each of the resulting integers into a binary string of length 4 in network byte-order:
`return b"%c%c%c%c" % (octets[0], octets[1], octets[2], octets[3])`
* Binary → String:
1. Take the four bytes of the binary input and convert each to its equivalent base-10 ASCII representation without any leading zeros:
`octets = [str(binary[idx]) for idx in range(4)]`
2. Concatinate resulting list of stringified octets using dots (U+002E FULL STOP):
`return ".".join(octets)`

Converting from string to binary addresses may be done using the POSIX
[`inet_addr`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/inet_addr.html)
function or the similar common Unix [`inet_aton`](https://man.cx/inet_aton(3))
function and its equivalent bindings in many other languages. Similarily the POSIX
[`inet_ntoa`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/inet_ntoa.html)
function available in many languages implements the previously mentioned binary
to string address transformation.

#### `ip6`

Encodes an IPv6 address according to the rules of [RFC 4291 section 2.2](https://tools.ietf.org/html/rfc4291#section-2.2) and [RFC 5962 section 4](https://tools.ietf.org/html/rfc5952#section-4).

Protocols using this codec must encode it as binary value of exactly 16 bytes without
an extra length value.

* String → Binary:
Parse the given input address string according to the rules of [RFC 4291 section 2.2](https://tools.ietf.org/html/rfc4291#section-2.2) creating a 16-byte binary string. All textual variations (upper-/lower-casing, IPv4-mapped addresses, zero-compression, stripping of leading zeros) must be supported by the parser. Note that [scoped IPv6 addressed containing a zone identifier](https://tools.ietf.org/html/draft-ietf-ipngwg-scopedaddr-format-02) may not appear in the input string; external mechanisms may be used to encode the zone identifier separately through.
* Binary → String:
Generate a canonical textual representation of the given binary input address according to rules of [RFC 5962 section 4](https://tools.ietf.org/html/rfc5952#section-4). Implementations must not produce any of the variations allowed by RFC 4291 mentioned above to ensure that all implementation produce a character by character identical string representation.

Converting between string to binary addresses should be done using the equivalent
of the POSIX [`inet_pton`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/inet_pton.html)
and [`inet_ntop`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/inet_ntop.html)
functions. Alternatively, using the BSD
[`getaddrinfo`/`freeaddrinfo`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/getaddrinfo.html)
and [`getnameinfo` with `NI_NUMERICHOST`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/getnameinfo.html)
may be a viable alternative for some environments.

### `onion`

Encodes a [TOR rendezvous version 2 service pointer](https://gitweb.torproject.org/torspec.git/tree/rend-spec-v2.txt?id=471af27b55ff3894551109b45848f2ce1002441b#n525) (aka .onion-address) and exposed service port on that system.

Protocols using this codec must encode it as binary value of exactly 12 bytes without
an extra length value.

* String → Binary:
1. Split the input string into 2 parts at the colon character (U+003A COLON):
`(service_str, port_str) = str.split(":")`
2. Decode the *service* part before the colon using base32 into binary:
`service_bin = b32decode(service_str)`
3. Convert the *port* part to a binary string as specified by the [`uint16be`](#uint16be) codec.
4. Concatenate the service and port parts to obtain the final binary encoding:
`return service_bin + port_bin`
* Binary → String:
1. Split the binary value at the last two bytes into an service name and a port
number:
`(service_bin, port_bin) = binary.split_at(-2)`
2. Convert the service part into a base32 string:
`service_str = b32encode(service_bin)`
3. Convert the *port* part to text as specified by the [`uint16be`](#uint16be) codec.
4. Concatenate the result strings using a colon:
`return service_str + ":" + port_str`

### `p2p`

Encodes a libp2p node address.

TBD: Is this really always a base58btc encoded string of at least 5 characters in length!?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's anything that parses as a peer ID (the canonical encoding is now actually a libp2p key CID): https://github.com/libp2p/specs/blob/master/peer-ids/peer-ids.md.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least the go implementation does not follow that spec at all but just expects everything to be base58btc: https://github.com/multiformats/go-multiaddr/blob/master/transcoders.go#L293-L315

If your saying that is just a bug to be fixed, I'll be happy to update this text and py-multiaddr accordingly. 🙂

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That spec also fails to mention that some IDs are still encoded with 1 as first character, which is also CIDv0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Alexander255 correct (well, not a bug so much as something we need to implement).

@ShadowJonathan also correct. I'll take a stab at fixing the spec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stebalien: I just tried to implement that spec in Python and there is no guidance in there what to do with binary CIDv0 IDs. I assume they are detected (similar to their string format) using len(buf) == 34 and buf.startswith(b"\x12\x20")?



### `uint16be`

Encodes an unsigned 16-bit integer value (such as a port number) in network byte
order (big endian).

Protocols using this codec must encode it as binary value of exactly 2 bytes without
an extra length value.

* String → Binary:
1. Parse the input string as base-10 integer:
`integer = int(str, 10)`
2. Verify that the integer is in a valid range for a positive 16-bit integer:
`assert integer in range(65536)`
3. Convert the integer to a 2-byte long big endian binary string:
`return b"%c%c" % ((integer >> 8) & 0xFF, integer & 0xFF)`
* Binary → String:
1. Convert the two input bytes to a native integer:
`integer = port_bin[0] << 8 | port_bin[1]`
2. Generate a base-10 string representation from this integer:
`return str(integer, 10)`

POSIX/BSD provides [`strtoul`](https://en.cppreference.com/w/c/string/byte/strtoul)
and [`htons`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/htons.html)
for the string to binary conversion and
[`ntohs`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/ntohs.html)
and [`snprintf`](https://en.cppreference.com/w/c/io/snprintf) for the performing
the inverse operation.

## Protocols

Expand Down Expand Up @@ -156,4 +313,4 @@ Small note: If editing the README, please conform to the [standard-readme](https

## License

This repository is only for documents. All of these are licensed under the [CC-BY-SA 3.0](https://ipfs.io/ipfs/QmVreNvKsQmQZ83T86cWSjPu2vR3yZHGPm5jnxFuunEB9u) license, © 2016 Protocol Labs Inc. Any code is under a [MIT](LICENSE) © 2016 Protocol Labs Inc.
This repository is only for documents. All of these are licensed under the [CC-BY-SA 4.0](https://ipfs.io/ipfs/QmVreNvKsQmQZ83T86cWSjPu2vR3yZHGPm5jnxFuunEB9u) license, © 2016 Protocol Labs Inc, © 2019 Alexander Schlarb. Any code is under a [MIT](LICENSE) © 2016 Protocol Labs Inc, © 2019 Alexander Schlarb.
64 changes: 32 additions & 32 deletions protocols.csv
Original file line number Diff line number Diff line change
@@ -1,32 +1,32 @@
code, size, name, comment
4, 32, ip4,
6, 16, tcp,
273, 16, udp,
33, 16, dccp,
41, 128, ip6,
42, V, ip6zone, rfc4007 IPv6 zone
53, V, dns, domain name resolvable to both IPv6 and IPv4 addresses
54, V, dns4, domain name resolvable only to IPv4 addresses
55, V, dns6, domain name resolvable only to IPv6 addresses
56, V, dnsaddr,
132, 16, sctp,
301, 0, udt,
302, 0, utp,
400, V, unix,
421, V, p2p, preferred over /ipfs
421, V, ipfs, backwards compatibility; equivalent to /p2p
444, 96, onion,
445, 296, onion3,
446, V, garlic64,
447, V, garlic32,
460, 0, quic,
480, 0, http,
443, 0, https,
477, 0, ws,
478, 0, wss,
479, 0, p2p-websocket-star,
277, 0, p2p-stardust,
275, 0, p2p-webrtc-star,
276, 0, p2p-webrtc-direct,
290, 0, p2p-circuit,
777, V, memory, in memory transport for self-dialing and testing; arbitrary
code, size, name, codec, comment
4, 32, ip4, ip4,
6, 16, tcp, uint16be,
273, 16, udp, uint16be,
33, 16, dccp, uint16be,
41, 128, ip6, ip6,
42, V, ip6zone, ?, rfc4007 IPv6 zone
53, V, dns, domain, domain name resolvable to both IPv6 and IPv4 addresses
54, V, dns4, domain, domain name resolvable only to IPv4 addresses
55, V, dns6, domain, domain name resolvable only to IPv6 addresses
56, V, dnsaddr, domain,
132, 16, sctp, uint16be,
301, 0, udt, –,
302, 0, utp, –,
400, V, unix, fspath,
421, V, p2p, p2p, preferred over /ipfs
421, V, ipfs, p2p, backwards compatibility; equivalent to /p2p
444, 96, onion, onion,
445, 296, onion3, ?,
446, V, garlic64, ?,
447, V, garlic32, ?,
460, 0, quic, –,
480, 0, http, –,
443, 0, https, –,
477, 0, ws, –,
478, 0, wss, –,
479, 0, p2p-websocket-star, –,
277, 0, p2p-stardust, –,
275, 0, p2p-webrtc-star, –,
276, 0, p2p-webrtc-direct, –,
290, 0, p2p-circuit, –,
777, V, memory, –, in memory transport for self-dialing and testing; arbitrary