-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: How should a digest_size > hash function's output be handled? #16
Comments
Thanks for this, great bug catch. I think this should probably be an error. Though we could define all extra bytes to always be zeros. that way can compare sha1 and sha256 at same lengths or something. |
I don't follow. What would be the use case for comparing digests from two different algorithms? |
@moreati oh for example in a Kademlia DHT for XOR distance between digests. |
How niche (or otherwise) a use case are those? Would you really use a multihash encoded digest directly in a DHT, given the biasing that occurs in the first two bytes? I'm worried that if padding bytes are introduced then it will lead to some vulnerability or error based on a false sense of security. I would very much prefer that behaviour a, (i.e. abort/signal an error) be the only conforming behaviour, or at leastt the strongly encouraged default behaviour. If padding is to be allowed by the standard it should be an explicit choice such as passing a flag |
it's not uncommon. im not sure either way, yet. but i hear you, re errors and not creating a misleading level of bit security. that's likely the right answer. |
+1 as @moreati for returning an error when lenght > lenght_of_digest_algorithm, both when computing a Sum() (as the utility function is called in the go-multihash implementation) and when packing a digest into a multihash value. Padding should be handled by the application that requires comparison of hashes of different lenghts. If indeed padding is what's needed. For instance, Kademlia implementations could decide to:
|
another option is to actually produce that many bytes of hash. for example, the "111 byte (888 bit) sha1 hash of 'foobar' " x := "foobar"
buf := make([]byte, 111) // 888 bits.
copy(sha1(x), buf[0:20]) // 160 bits of sha1(x)
copy(sha1(x + "1"), buf[20:40]) // 160 bits of sha1(x + "1")
copy(sha1(x + "2"), buf[40:60]) // 160 bits of sha1(x + "2")
copy(sha1(x + "3"), buf[60:80]) // 160 bits of sha1(x + "3")
copy(sha1(x + "4"), buf[80:100]) // 160 bits of sha1(x + "4")
copy(sha1(x + "5"), buf[100:111]) // 88 bits of sha1(x + "5") |
@jbenet But you can only do that when you have the blob (in this case, "foobar"), so it's useless for, for instance, doing Kademlia XOR comparisons when looking for the content corresponding to a sha1 multihash (your original example, if I understand correctly). That's what I mean when I say that "padding (of a multihash) should be done by the application (that receives the multihash and has to operate on multihashes only, in absence of the corresponding blob)". Sorry for not being clearer earlier. |
Well usually one ends up with a hash by finding it-- it was always generated by someone with the data. And most times the hash used for DHTs and so on is a hash of some data structure (the dag, not "the file"), which needs to be found as well. I'm not sure that use case precludes the thing I mentioned. But I'm not sure the thing I mentioned is useful in any case |
I feel like a fool for arguing with you about this. I'm going to put a pin in it as "maybe I need to understand it better". |
Not at all, all of this is useful discussion. |
cc @eternaleye for feedback |
Honestly, if you need variable-length output longer than the hash, you fundamentally don't want a hash. You want a KDF. @jbenet, your Nov 19 post was basically an attempt at constructing a KDF from scratch, which I strongly recommend against. Once that's recognized, the sensible options for multihash collapse down to HKDF; HMAC-based Key Derivation Function. Efficient (does only one pass over the IKM), supports both "context" and "salt" parameters, can produce arbitrary length, and can be instantiated with any hash. In addition, I consider it essential that multihash include the target length in the input to the KDF - the reason for this is that, without doing so, an attacker can silently truncate a multihash in order to permit incorrect data to pass undetected. Even so, a strict minumum hash length is necessary in order to prevent an attacker from cutting it really short, and then just bruteforcing. Honestly, anything that is part of the multihash syntax (and thus out-of-band) should be included in the calculation (in HKDF, as part of the context) to avoid truncation attacks, algorithm substitution attacks, etc. Another benefit of using HKDF: It uses a staged API, in which one calls two functions:
This allows the expensive part (hashing the data) to be done once, and the cheap part (generating relatively short outputs) to reuse the result of that. |
@eternaleye is there a way to do what you described with the following constraints:
Sounds to me like no, given that you think we should add lengths to the input to avoid silent truncation. BTW silent truncation still has to run up against the expected length value. If the attacker can both modify the expected length value, and truncate it, then they could also modify the hash digest too, so not sure what's won. (in practice there may be attacks that depend on being able to modify only one byte (make the len tiny) instead of many (replace the whole digest with an attacker-chosen one). |
Maybe TRTTD here is to define a different function "Multihash HKDF", give it a multihash code and all, which, given some other secure multihash function, constructs a HKDF. the function can embed the multihash code of the sub-function it's using in its digest. For example:
|
As for the original question, "how should a digest_size > hash function's output be handled?", it looks like an error for now. |
I would agree with that assessment. However, I would not recommend using the expanded output as the multihash value. Specifically, I'd suggest the multihash value be the PRK, with the information of the underlying multihash passed in the salt. Then, use cases which require a specific length use |
By now the "hash function's output length" is not explicitly known so implementations can't do anything about this case (see #114 for a fix). Given this information I'd say
|
How should a multihash implementation handle a call where the digest is longer than the specified output of that hash function? E.g. (pseudocode)
Conversely, is the following multihash valid? (hex encoded, spaces added for legibility)
The code field declares the hash function is SHA-1 (0x11). The length field declares the digest is 26 bytes long, and the received digest field is 26 bytes long. However SHA-1 can't produce 26-byte digest.
When a multihash library is called to encode/decode such an input it could:
a. Signal an error, and stop encoding/decoding?
b. Set the length field to len(digest), then continue with processing?
c. Truncate the digest field, before continuing with processing?
The behaviour is currently unspecified, i.e. implementations can do whatever they want. AFAICT go-multihash does b. I'd like to propose that a. as a required behaviour for standards conforming implementations.
What do you think? Have I missed a specification document? Would you prefer I sent this to a mailing list (e.g. ipfs-users)?
The text was updated successfully, but these errors were encountered: