-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add 'jcs' and 'urdna2015' canonicalization values. #261
Conversation
I only had a quick look at JCS and urdna2015. Do I understand it correctly |
That's it, exactly. Tagging |
Ok, on further conversation, it might be less confusing to people if this PR introduced a new tag (instead of overloading the use of |
In your original comment you mention hashlinking. Is the goal to use that multicodec code as part of a CID? I'm asking as I think this request poses an interesting question. If I think in terms of a CID, where we specify the encoding as well as the hash algorithm, the question is, should this be the encoding information or the hash algorithm information? To me a CID is self-describing on how to get from the bytes it points to, to some deserialized version of it and back. If the hash algorithm is always SHA-256, I can see two ways describing it:
In both cases you'd have all the information you need. |
@vmx, We would ideally like to design this in such a way that any hash algorithm from the multihash table could be used -- without having to create NxM combination codec values. So, we can express that some data was canonicalized with algorithm X ( |
@dlongley This means that |
Right, exactly. They're essentially a second parameter to the multihash (what pre-processing steps must be taken with the data before hashing).
JCS is always JSON. URDCA2015 is any sort of RDF-based linked data (which includes JSON, Turtle, RDF-XML, N-Quads, etc).
Right, so, this is the tricky part. I'd say the situation is closer to 1 -- the hash algorithm is "canonicalize things first, then do a SHA-256" hash. And the encoding (of the hash) is multibase. (I'm not sure it's necessary to specify the encoding of the pre-hash data, though. Since the hash is a one-way operation.) @vmx - would you be open to defining a new "canonized hash" tag? |
Finally found time to look at this and give my 2c.
|
I'd like to check if I understood the current outcome correctly. The
This points to some data. Now I retrieve the data and I want to create a CID out of it. I would only know that I need to canonicalize the the data before hashing, but I wouldn't know which hash algorithm to use. Is that correct? |
@dmitrizagidulin any changes to this you want to pursue so we can get this over the line in some form? |
Hi @rvagg, thanks for checking in. |
82f190f
to
6e186dc
Compare
Hi @rvagg -- after some discussion with @gobengo, I've updated the PR (and resolved merge conflicts) to hopefully address some of your concerns.
Totally understood wanting to make space -- I moved the JCS canonicalization entry to post-poseidon. We've also updated the tag for those two entries to re-use |
6e186dc
to
f2559ee
Compare
Does that mean the existing implementations need to change? If not, why not? |
Hey @dlongley - no, no existing implementations need to change. The tag in the CSV file is conceptual / for organizing things into categories, it's not used in the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, OK, I think we can just merge these now, although I'll register two final comments:
- I'm still unsure if
ipld
is the right way to go,serialization
might be better, we tend to useipld
for schemes that yield linked data .. maybe this does, maybe it's a scheme that yields a single link, but the canonicalisation is also something that we do more inipld
than genericserialization
schemes so 🤷. - The placement is pretty annoying, I'd really like to have reserved the
0xb4xx
block for poseidon*. I get that you've deployed this and that's certainly a strong consideration, but still pretty annoying. It's going to be an ugly duckling amongst additional poseidon entries.
Thanks for the merge @rvagg. To come back to If not, what if we introduced a new "transformed-multihash" namespace? It's not clear to me what constitutes a "namespace" vs. a "multiformat". |
@msporny the tags really don't matter that much so it's not worth getting too hung up about it - I imagine a future point where we refactor a bunch of the organisational stuff and they become more relevant at which point we take a more holistic view of what we have and do some adjustment. If something feels like it should be just "multiformat" then we should probably just invent a new tag for it - if you're making something that could be described in a new multiformat spec then make a tag as a new category. I'm not sure about "namespace", mostly I treat those as networking / libp2p related so usually not appropriate for hashing or encoding. I'd be happy for someone to come up with a new tag for this, but maybe something broad enough that can fit other things too? |
Can't believe I'm just seeing this now! Really glad that this has been put in place. IMO IPLD is absolutely something that we should look into here since we can use this as a component of IPLD based database systems at large. |
@@ -483,8 +483,10 @@ skein1024-1016, multihash, 0xb3df, draft, | |||
skein1024-1024, multihash, 0xb3e0, draft, | |||
poseidon-bls12_381-a2-fc1, multihash, 0xb401, permanent, Poseidon using BLS12-381 and arity of 2 with Filecoin parameters | |||
poseidon-bls12_381-a2-fc1-sc, multihash, 0xb402, draft, Poseidon using BLS12-381 and arity of 2 with Filecoin parameters - high-security variant | |||
urdca-2015-canon, ipld, 0xb403, draft, The result of canonicalizing an input according to URDCA-2015 and then expressing its hash value as a multihash value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be urdna-2015-canon
with an n
not a c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a debate raging over what we should call it. Traditionally, we used "n" to mean "normalization"... but it's generally accepted now that we should've said "canonicalization" since it's a more accurate description of what's happening. Thus, the "urdca" vs. "urdna" distinction. This is currently being discussed in the W3C RDF Dataset, Canonicalization, and Hashing Working Group (note that we didn't call it the "normalization" working group).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msporny is there a uri for that issue or do I need to file one? I just earlier today noticed meetings are started and I need to get that on my cal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adds a new
canonhash
tag value that represents a combination canonicalization+hash operation (using RDF Dataset Canonicalization URDNA2015, soon to be renamed to URDCA2015).Used for the hashlinking of Verifiable Credentials proposal to the W3C VC WG, in the implementation of
digestMultibase
.digestMultibase
example: