-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add document-retrieval to Hub as a task #1097
base: main
Are you sure you want to change the base?
Conversation
@pcuenca can you leave a review? 👀💗 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it not already very close to image-feature-extraction
? How many models would be covered by this? (i'm wondering if it isn't too specific)
(no strong opinion though) |
@@ -676,6 +676,11 @@ export const PIPELINE_DATA = { | |||
color: "red", | |||
hideInDatasets: true, | |||
}, | |||
"document-retrieval": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds a bit NLP-y to me. Could we use something like "visual-document-retrieval", for symmetry with "visual-question-answering"?
I have the same question. Would, for example, all the ColPali models be included here? |
We'd also have to tag multiple models before merging, as usual. |
@pcuenca I wanted to wait for the naming consensus before opening PRs to them. we can do visual-document-retrieval yes. @julien-c it's actually not. those are singular image backbones used to train traditional vision models. these models on the contrast are zero shot models built on VLMs to do document retrieval on multimodal RAG pipelines. they're a bit like CLIP, but for documents, and they're not used for classification (they have long context length and have fine grained image understanding). the number of models keep increasing as number of VLMs increase and they're all wrongly tagged hence this PR. (there's ColPali, ColQwen, ColSmolVLM, DSE models and more now) here's an tldr explainer on how they're used https://x.com/mervenoyann/status/1831409380040044762?s=46 |
ok, sounds good |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe image-document-retrieval
so that it's more in sync with the current task vision/ audio related tasks names?
"document-retrieval": { | ||
name: "Document Retrieval", | ||
modality: "multimodal", | ||
color: "yellow", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
color: "yellow", | |
color: "yellow", | |
hideInDatasets: true, |
Don't think there are many datasets related to this task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a whole benchmark called ViDoRe and more datasets outside of this benchmark, and similar datasets exist in multiple languages so I would expect more datasets -- https://huggingface.co/spaces/vidore/vidore-leaderboard
for instance these are what I found for Turkish
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not in favor of image-document-retrieval, it sounds a bit like images are involved separately and sounds off imo. we have a similar task actually, visual-question-answering which separates from textual QA
This PR adds document retrieval to Hub for following models that are used heavily and now there's a lot of them:
Icon looks like this:
Other names I thought of:
this name is very to the point so best if it stays like it