multiformats · ntninja · Sep 1, 2019 · Sep 1, 2019 · Jan 12, 2020 · Aug 14, 2020
diff --git a/README.md b/README.md
@@ -104,6 +104,163 @@ TODO: specify the encoding (byte-array to string) procedure
 
 TODO: specify the decoding (string to byte-array) procedure
 
+### Codecs
+
+Depending on the protocol of a Multiaddr component, different algorithms are used to
+convert their values from/to binary representation. The name of the codec to use
+for each protocol is noted in [protocols.csv](protocols.csv).
+
+In general empty values in the string representation are always disallowed unless
+explicitely noted otherwise. In case of conversion errors implementation must
+refuse to process the given string/binary value and report the error to the caller
+instead.
+
+Depending on the codec type codecs may either be encoded using the standard variable
+length encoding style, or into a specific static-length binary value without the
+extra length information if this is noted in the respective codec's description.
+
+All code examples are written in Python-based pseudo code and are optimized for
+legibility rather than speed. In general you should always use existing libraries
+and functions for performing the below conversions rather than rolling your own.
+
+#### `fspath`
+
+Encodes a local file system path with unspecified binary encoding. On platforms
+not using POSIX-style forward slashes (`/`) for delimiting individual path
+labels, such as Windows, implementations should automatically convert such
+paths from their POSIX representation as necessary.
+
+Protocols using the `fspath` encoding are only valid for the system they were
+created for and must not be shared between different hosts.
+
+#### `domain`
+
+Encodes the given Unicode representation to the UTF-8 character encoding ([RFC 3629 Section 3](https://tools.ietf.org/html/rfc3629#section-3)), while using the [UTS-46 / RFC 5890](https://tools.ietf.org/html/rfc5890) input normalization and processing rules for canonicalization.
+
+* String → Binary:
+   1. If feasible, normalize and validate the given input string according to [UTS-46 Section 4 (Processing)](https://www.unicode.org/reports/tr46/#Processing) and [UTS-46 Section 4.1 (Validity Criteria)](https://www.unicode.org/reports/tr46/#Validity_Criteria) with the following parameters:
+       * UseSTD3ASCIIRules = true
+       * CheckHyphens = true
+       * CheckBidi = true
+       * CheckJoiners = true
+       * Transitional_Processing = false
+   2. Convert the Unicode string to the UTF-8 character encoding as per [RFC 3629 Section 3 §4](https://tools.ietf.org/html/rfc3629#section-3).
+* Binary → String:  
+  Convert the UTF-8 encoded binary string to Unicode according to the rules of [RFC 3629 Section 3 §6](https://tools.ietf.org/html/rfc3629#page-5).
+
+Examples of libraries for performing the above normalization step include the `idna.uts46_remap` function of the [Python idna](https://pypi.org/project/idna/) library.
+
+#### `ip4`
+
+Encodes an IPv4 address according to the conventional [dot-decimal notation](https://en.wikipedia.org/wiki/Dot-decimal_notation) first specificed in [RFC 3986 section 3.2.2 page 20 § 2](https://tools.ietf.org/html/rfc3986#page-20).
+
+Protocols using this codec must encode it as binary value of exactly 4 bytes without
+an extra length value.
+
+ * String → Binary:
+    1. Split the input string into parts at each dot (U+002E FULL STOP):  
+       `sparts = str.split(".")`
+    2. Assert that exactly 4 string parts were created by the split operation:  
+       `assert len(parts) == 4`
+    3. Convert each part from its ASCII base-10 number representation to an integer type, aborting if the conversion fails for any of the decimal string parts:  
+       `octets = [int(p) for p in parts]`
+    4. Validate that each part of the resulting integer list is in rage 0 – 255:  
+       `assert all(i in range(0, 256) for i in octets)`
+    4. Copy each of the resulting integers into a binary string of length 4 in network byte-order:  
+       `return b"%c%c%c%c" % (octets[0], octets[1], octets[2], octets[3])`
+ * Binary → String:
+    1. Take the four bytes of the binary input and convert each to its equivalent base-10 ASCII representation without any leading zeros:  
+       `octets = [str(binary[idx]) for idx in range(4)]`
+    2. Concatinate resulting list of stringified octets using dots (U+002E FULL STOP):  
+       `return ".".join(octets)`
+
+Converting from string to binary addresses may be done using the POSIX
+[`inet_addr`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/inet_addr.html)
+function or the similar common Unix [`inet_aton`](https://man.cx/inet_aton(3))
+function and its equivalent bindings in many other languages. Similarily the POSIX
+[`inet_ntoa`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/inet_ntoa.html)
+function available in many languages implements the previously mentioned binary
+to string address transformation.
+
+#### `ip6`
+
+Encodes an IPv6 address according to the rules of [RFC 4291 section 2.2](https://tools.ietf.org/html/rfc4291#section-2.2) and [RFC 5962 section 4](https://tools.ietf.org/html/rfc5952#section-4).
+
+Protocols using this codec must encode it as binary value of exactly 16 bytes without
+an extra length value.
+
+ * String → Binary:  
+   Parse the given input address string according to the rules of [RFC 4291 section 2.2](https://tools.ietf.org/html/rfc4291#section-2.2) creating a 16-byte binary string. All textual variations (upper-/lower-casing, IPv4-mapped addresses, zero-compression, stripping of leading zeros) must be supported by the parser. Note that [scoped IPv6 addressed containing a zone identifier](https://tools.ietf.org/html/draft-ietf-ipngwg-scopedaddr-format-02) may not appear in the input string; external mechanisms may be used to encode the zone identifier separately through.
+ * Binary → String:  
+   Generate a canonical textual representation of the given binary input address according to rules of [RFC 5962 section 4](https://tools.ietf.org/html/rfc5952#section-4). Implementations must not produce any of the variations allowed by RFC 4291 mentioned above to ensure that all implementation produce a character by character identical string representation.
+
+Converting between string to binary addresses should be done using the equivalent
+of the POSIX [`inet_pton`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/inet_pton.html)
+and [`inet_ntop`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/inet_ntop.html)
+functions. Alternatively, using the BSD
+[`getaddrinfo`/`freeaddrinfo`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/getaddrinfo.html)
+and [`getnameinfo` with `NI_NUMERICHOST`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/getnameinfo.html)
+may be a viable alternative for some environments.
+
+### `onion`
+
+Encodes a [TOR rendezvous version 2 service pointer](https://gitweb.torproject.org/torspec.git/tree/rend-spec-v2.txt?id=471af27b55ff3894551109b45848f2ce1002441b#n525) (aka .onion-address) and exposed service port on that system.
+
+Protocols using this codec must encode it as binary value of exactly 12 bytes without
+an extra length value.
+
+ * String → Binary:
+    1. Split the input string into 2 parts at the colon character (U+003A COLON):  
+       `(service_str, port_str) = str.split(":")`
+    2. Decode the *service* part before the colon using base32 into binary:  
+       `service_bin = b32decode(service_str)`
+    3. Convert the *port* part to a binary string as specified by the [`uint16be`](#uint16be) codec.
+    4. Concatenate the service and port parts to obtain the final binary encoding:  
+       `return service_bin + port_bin`
+ * Binary → String:
+    1. Split the binary value at the last two bytes into an service name and a port
+       number:  
+       `(service_bin, port_bin) = binary.split_at(-2)`
+    2. Convert the service part into a base32 string:  
+       `service_str = b32encode(service_bin)`
+    3. Convert the *port* part to text as specified by the [`uint16be`](#uint16be) codec.
+    4. Concatenate the result strings using a colon:  
+       `return service_str + ":" + port_str`
+
+### `p2p`
+
+Encodes a libp2p node address.
+
+TBD: Is this really always a base58btc encoded string of at least 5 characters in length!?
+
+
+### `uint16be`
+
+Encodes an unsigned 16-bit integer value (such as a port number) in network byte
+order (big endian).
+
+Protocols using this codec must encode it as binary value of exactly 2 bytes without
+an extra length value.
+
+ * String → Binary:
+    1. Parse the input string as base-10 integer:  
+       `integer = int(str, 10)`
+    2. Verify that the integer is in a valid range for a positive 16-bit integer:  
+       `assert integer in range(65536)`
+    3. Convert the integer to a 2-byte long big endian binary string:  
+       `return b"%c%c" % ((integer >> 8) & 0xFF, integer & 0xFF)`
+ * Binary → String:
+    1. Convert the two input bytes to a native integer:  
+       `integer = port_bin[0] << 8 | port_bin[1]`
+    2. Generate a base-10 string representation from this integer:  
+       `return str(integer, 10)`
+
+POSIX/BSD provides [`strtoul`](https://en.cppreference.com/w/c/string/byte/strtoul)
+and [`htons`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/htons.html)
+for the string to binary conversion and
+[`ntohs`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/ntohs.html)
+and [`snprintf`](https://en.cppreference.com/w/c/io/snprintf) for the performing
+the inverse operation.
 
 ## Protocols
 
@@ -156,4 +313,4 @@ Small note: If editing the README, please conform to the [standard-readme](https
 
 ## License
 
-This repository is only for documents. All of these are licensed under the [CC-BY-SA 3.0](https://ipfs.io/ipfs/QmVreNvKsQmQZ83T86cWSjPu2vR3yZHGPm5jnxFuunEB9u) license, © 2016 Protocol Labs Inc. Any code is under a [MIT](LICENSE) © 2016 Protocol Labs Inc.
+This repository is only for documents. All of these are licensed under the [CC-BY-SA 4.0](https://ipfs.io/ipfs/QmVreNvKsQmQZ83T86cWSjPu2vR3yZHGPm5jnxFuunEB9u) license, © 2016 Protocol Labs Inc, © 2019 Alexander Schlarb. Any code is under a [MIT](LICENSE) © 2016 Protocol Labs Inc, © 2019 Alexander Schlarb.
diff --git a/protocols.csv b/protocols.csv
@@ -1,32 +1,32 @@
-code,	size,	name,	comment
-4,	32,	ip4,
-6,	16,	tcp,
-273,	16,	udp,
-33,	16,	dccp,
-41,	128,	ip6,
-42,	V,	ip6zone,	rfc4007 IPv6 zone
-53,	V,	dns,  domain name resolvable to both IPv6 and IPv4 addresses
-54,	V,	dns4, domain name resolvable only to IPv4 addresses
-55,	V,	dns6, domain name resolvable only to IPv6 addresses
-56,	V,	dnsaddr,
-132,	16,	sctp,
-301,	0,	udt,
-302,	0,	utp,
-400,	V,	unix,
-421,	V,	p2p,	preferred over /ipfs
-421,	V,	ipfs,	backwards compatibility; equivalent to /p2p
-444,	96,	onion,
-445,	296,	onion3,
-446,	V,	garlic64,
-447,	V,	garlic32,
-460,	0,	quic,
-480,	0,	http,
-443,	0,	https,
-477,	0,	ws,
-478,	0,	wss,
-479,	0,	p2p-websocket-star,
-277,	0,	p2p-stardust,
-275,	0,	p2p-webrtc-star,
-276,	0,	p2p-webrtc-direct,
-290,	0,	p2p-circuit,
-777,	V, memory, in memory transport for self-dialing and testing; arbitrary 
+code,	size,	name,	codec,	comment
+4,	32,	ip4,	ip4,
+6,	16,	tcp,	uint16be,
+273,	16,	udp,	uint16be,
+33,	16,	dccp,	uint16be,
+41,	128,	ip6,	ip6,
+42,	V,	ip6zone,	?,	rfc4007 IPv6 zone
+53,	V,	dns,	domain,	domain name resolvable to both IPv6 and IPv4 addresses
+54,	V,	dns4,	domain,	domain name resolvable only to IPv4 addresses
+55,	V,	dns6,	domain,	domain name resolvable only to IPv6 addresses
+56,	V,	dnsaddr,	domain,
+132,	16,	sctp,	uint16be,
+301,	0,	udt,	–,
+302,	0,	utp,	–,
+400,	V,	unix,	fspath,
+421,	V,	p2p,	p2p,	preferred over /ipfs
+421,	V,	ipfs,	p2p,	backwards compatibility; equivalent to /p2p
+444,	96,	onion,	onion,
+445,	296,	onion3,	?,
+446,	V,	garlic64,	?,
+447,	V,	garlic32,	?,
+460,	0,	quic,	–,
+480,	0,	http,	–,
+443,	0,	https,	–,
+477,	0,	ws,	–,
+478,	0,	wss,	–,
+479,	0,	p2p-websocket-star,	–,
+277,	0,	p2p-stardust,	–,
+275,	0,	p2p-webrtc-star,	–,
+276,	0,	p2p-webrtc-direct,	–,
+290,	0,	p2p-circuit,	–,
+777,	V,	memory,	–,	in memory transport for self-dialing and testing; arbitrary