diff --git a/README.md b/README.md index 00b0343..79abc87 100644 --- a/README.md +++ b/README.md @@ -104,6 +104,163 @@ TODO: specify the encoding (byte-array to string) procedure TODO: specify the decoding (string to byte-array) procedure +### Codecs + +Depending on the protocol of a Multiaddr component, different algorithms are used to +convert their values from/to binary representation. The name of the codec to use +for each protocol is noted in [protocols.csv](protocols.csv). + +In general empty values in the string representation are always disallowed unless +explicitely noted otherwise. In case of conversion errors implementation must +refuse to process the given string/binary value and report the error to the caller +instead. + +Depending on the codec type codecs may either be encoded using the standard variable +length encoding style, or into a specific static-length binary value without the +extra length information if this is noted in the respective codec's description. + +All code examples are written in Python-based pseudo code and are optimized for +legibility rather than speed. In general you should always use existing libraries +and functions for performing the below conversions rather than rolling your own. + +#### `fspath` + +Encodes a local file system path with unspecified binary encoding. On platforms +not using POSIX-style forward slashes (`/`) for delimiting individual path +labels, such as Windows, implementations should automatically convert such +paths from their POSIX representation as necessary. + +Protocols using the `fspath` encoding are only valid for the system they were +created for and must not be shared between different hosts. + +#### `domain` + +Encodes the given Unicode representation to the UTF-8 character encoding ([RFC 3629 Section 3](https://tools.ietf.org/html/rfc3629#section-3)), while using the [UTS-46 / RFC 5890](https://tools.ietf.org/html/rfc5890) input normalization and processing rules for canonicalization. + +* String → Binary: + 1. If feasible, normalize and validate the given input string according to [UTS-46 Section 4 (Processing)](https://www.unicode.org/reports/tr46/#Processing) and [UTS-46 Section 4.1 (Validity Criteria)](https://www.unicode.org/reports/tr46/#Validity_Criteria) with the following parameters: + * UseSTD3ASCIIRules = true + * CheckHyphens = true + * CheckBidi = true + * CheckJoiners = true + * Transitional_Processing = false + 2. Convert the Unicode string to the UTF-8 character encoding as per [RFC 3629 Section 3 §4](https://tools.ietf.org/html/rfc3629#section-3). +* Binary → String: + Convert the UTF-8 encoded binary string to Unicode according to the rules of [RFC 3629 Section 3 §6](https://tools.ietf.org/html/rfc3629#page-5). + +Examples of libraries for performing the above normalization step include the `idna.uts46_remap` function of the [Python idna](https://pypi.org/project/idna/) library. + +#### `ip4` + +Encodes an IPv4 address according to the conventional [dot-decimal notation](https://en.wikipedia.org/wiki/Dot-decimal_notation) first specificed in [RFC 3986 section 3.2.2 page 20 § 2](https://tools.ietf.org/html/rfc3986#page-20). + +Protocols using this codec must encode it as binary value of exactly 4 bytes without +an extra length value. + + * String → Binary: + 1. Split the input string into parts at each dot (U+002E FULL STOP): + `sparts = str.split(".")` + 2. Assert that exactly 4 string parts were created by the split operation: + `assert len(parts) == 4` + 3. Convert each part from its ASCII base-10 number representation to an integer type, aborting if the conversion fails for any of the decimal string parts: + `octets = [int(p) for p in parts]` + 4. Validate that each part of the resulting integer list is in rage 0 – 255: + `assert all(i in range(0, 256) for i in octets)` + 4. Copy each of the resulting integers into a binary string of length 4 in network byte-order: + `return b"%c%c%c%c" % (octets[0], octets[1], octets[2], octets[3])` + * Binary → String: + 1. Take the four bytes of the binary input and convert each to its equivalent base-10 ASCII representation without any leading zeros: + `octets = [str(binary[idx]) for idx in range(4)]` + 2. Concatinate resulting list of stringified octets using dots (U+002E FULL STOP): + `return ".".join(octets)` + +Converting from string to binary addresses may be done using the POSIX +[`inet_addr`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/inet_addr.html) +function or the similar common Unix [`inet_aton`](https://man.cx/inet_aton(3)) +function and its equivalent bindings in many other languages. Similarily the POSIX +[`inet_ntoa`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/inet_ntoa.html) +function available in many languages implements the previously mentioned binary +to string address transformation. + +#### `ip6` + +Encodes an IPv6 address according to the rules of [RFC 4291 section 2.2](https://tools.ietf.org/html/rfc4291#section-2.2) and [RFC 5962 section 4](https://tools.ietf.org/html/rfc5952#section-4). + +Protocols using this codec must encode it as binary value of exactly 16 bytes without +an extra length value. + + * String → Binary: + Parse the given input address string according to the rules of [RFC 4291 section 2.2](https://tools.ietf.org/html/rfc4291#section-2.2) creating a 16-byte binary string. All textual variations (upper-/lower-casing, IPv4-mapped addresses, zero-compression, stripping of leading zeros) must be supported by the parser. Note that [scoped IPv6 addressed containing a zone identifier](https://tools.ietf.org/html/draft-ietf-ipngwg-scopedaddr-format-02) may not appear in the input string; external mechanisms may be used to encode the zone identifier separately through. + * Binary → String: + Generate a canonical textual representation of the given binary input address according to rules of [RFC 5962 section 4](https://tools.ietf.org/html/rfc5952#section-4). Implementations must not produce any of the variations allowed by RFC 4291 mentioned above to ensure that all implementation produce a character by character identical string representation. + +Converting between string to binary addresses should be done using the equivalent +of the POSIX [`inet_pton`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/inet_pton.html) +and [`inet_ntop`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/inet_ntop.html) +functions. Alternatively, using the BSD +[`getaddrinfo`/`freeaddrinfo`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/getaddrinfo.html) +and [`getnameinfo` with `NI_NUMERICHOST`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/getnameinfo.html) +may be a viable alternative for some environments. + +### `onion` + +Encodes a [TOR rendezvous version 2 service pointer](https://gitweb.torproject.org/torspec.git/tree/rend-spec-v2.txt?id=471af27b55ff3894551109b45848f2ce1002441b#n525) (aka .onion-address) and exposed service port on that system. + +Protocols using this codec must encode it as binary value of exactly 12 bytes without +an extra length value. + + * String → Binary: + 1. Split the input string into 2 parts at the colon character (U+003A COLON): + `(service_str, port_str) = str.split(":")` + 2. Decode the *service* part before the colon using base32 into binary: + `service_bin = b32decode(service_str)` + 3. Convert the *port* part to a binary string as specified by the [`uint16be`](#uint16be) codec. + 4. Concatenate the service and port parts to obtain the final binary encoding: + `return service_bin + port_bin` + * Binary → String: + 1. Split the binary value at the last two bytes into an service name and a port + number: + `(service_bin, port_bin) = binary.split_at(-2)` + 2. Convert the service part into a base32 string: + `service_str = b32encode(service_bin)` + 3. Convert the *port* part to text as specified by the [`uint16be`](#uint16be) codec. + 4. Concatenate the result strings using a colon: + `return service_str + ":" + port_str` + +### `p2p` + +Encodes a libp2p node address. + +TBD: Is this really always a base58btc encoded string of at least 5 characters in length!? + + +### `uint16be` + +Encodes an unsigned 16-bit integer value (such as a port number) in network byte +order (big endian). + +Protocols using this codec must encode it as binary value of exactly 2 bytes without +an extra length value. + + * String → Binary: + 1. Parse the input string as base-10 integer: + `integer = int(str, 10)` + 2. Verify that the integer is in a valid range for a positive 16-bit integer: + `assert integer in range(65536)` + 3. Convert the integer to a 2-byte long big endian binary string: + `return b"%c%c" % ((integer >> 8) & 0xFF, integer & 0xFF)` + * Binary → String: + 1. Convert the two input bytes to a native integer: + `integer = port_bin[0] << 8 | port_bin[1]` + 2. Generate a base-10 string representation from this integer: + `return str(integer, 10)` + +POSIX/BSD provides [`strtoul`](https://en.cppreference.com/w/c/string/byte/strtoul) +and [`htons`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/htons.html) +for the string to binary conversion and +[`ntohs`](https://pubs.opengroup.org/onlinepubs/9699919799/functions/ntohs.html) +and [`snprintf`](https://en.cppreference.com/w/c/io/snprintf) for the performing +the inverse operation. ## Protocols @@ -156,4 +313,4 @@ Small note: If editing the README, please conform to the [standard-readme](https ## License -This repository is only for documents. All of these are licensed under the [CC-BY-SA 3.0](https://ipfs.io/ipfs/QmVreNvKsQmQZ83T86cWSjPu2vR3yZHGPm5jnxFuunEB9u) license, © 2016 Protocol Labs Inc. Any code is under a [MIT](LICENSE) © 2016 Protocol Labs Inc. +This repository is only for documents. All of these are licensed under the [CC-BY-SA 4.0](https://ipfs.io/ipfs/QmVreNvKsQmQZ83T86cWSjPu2vR3yZHGPm5jnxFuunEB9u) license, © 2016 Protocol Labs Inc, © 2019 Alexander Schlarb. Any code is under a [MIT](LICENSE) © 2016 Protocol Labs Inc, © 2019 Alexander Schlarb. diff --git a/protocols.csv b/protocols.csv index b028881..4b8078f 100644 --- a/protocols.csv +++ b/protocols.csv @@ -1,32 +1,32 @@ -code, size, name, comment -4, 32, ip4, -6, 16, tcp, -273, 16, udp, -33, 16, dccp, -41, 128, ip6, -42, V, ip6zone, rfc4007 IPv6 zone -53, V, dns, domain name resolvable to both IPv6 and IPv4 addresses -54, V, dns4, domain name resolvable only to IPv4 addresses -55, V, dns6, domain name resolvable only to IPv6 addresses -56, V, dnsaddr, -132, 16, sctp, -301, 0, udt, -302, 0, utp, -400, V, unix, -421, V, p2p, preferred over /ipfs -421, V, ipfs, backwards compatibility; equivalent to /p2p -444, 96, onion, -445, 296, onion3, -446, V, garlic64, -447, V, garlic32, -460, 0, quic, -480, 0, http, -443, 0, https, -477, 0, ws, -478, 0, wss, -479, 0, p2p-websocket-star, -277, 0, p2p-stardust, -275, 0, p2p-webrtc-star, -276, 0, p2p-webrtc-direct, -290, 0, p2p-circuit, -777, V, memory, in memory transport for self-dialing and testing; arbitrary +code, size, name, codec, comment +4, 32, ip4, ip4, +6, 16, tcp, uint16be, +273, 16, udp, uint16be, +33, 16, dccp, uint16be, +41, 128, ip6, ip6, +42, V, ip6zone, ?, rfc4007 IPv6 zone +53, V, dns, domain, domain name resolvable to both IPv6 and IPv4 addresses +54, V, dns4, domain, domain name resolvable only to IPv4 addresses +55, V, dns6, domain, domain name resolvable only to IPv6 addresses +56, V, dnsaddr, domain, +132, 16, sctp, uint16be, +301, 0, udt, –, +302, 0, utp, –, +400, V, unix, fspath, +421, V, p2p, p2p, preferred over /ipfs +421, V, ipfs, p2p, backwards compatibility; equivalent to /p2p +444, 96, onion, onion, +445, 296, onion3, ?, +446, V, garlic64, ?, +447, V, garlic32, ?, +460, 0, quic, –, +480, 0, http, –, +443, 0, https, –, +477, 0, ws, –, +478, 0, wss, –, +479, 0, p2p-websocket-star, –, +277, 0, p2p-stardust, –, +275, 0, p2p-webrtc-star, –, +276, 0, p2p-webrtc-direct, –, +290, 0, p2p-circuit, –, +777, V, memory, –, in memory transport for self-dialing and testing; arbitrary