Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add document-retrieval to Hub as a task #1097

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

merveenoyan
Copy link
Contributor

@merveenoyan merveenoyan commented Jan 13, 2025

This PR adds document retrieval to Hub for following models that are used heavily and now there's a lot of them:

Icon looks like this:
Screenshot 2025-01-13 at 11 10 29

Other names I thought of:

  • multimodal-feature-extraction (ambiguous, also overlaps with zero shot image classification)
  • document-feature-extraction (sounds like plain document embeddings for a document backbone, best if this doesn't happen and those models are covered under image-feature-extraction instead)
  • zero-shot-document-classification (too long, sounds odd but would be accurate, but we use these models often for retrieval and not this purpose)
    this name is very to the point so best if it stays like it

@merveenoyan
Copy link
Contributor Author

@NielsRogge

@merveenoyan
Copy link
Contributor Author

@pcuenca can you leave a review? 👀💗

Copy link
Member

@julien-c julien-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it not already very close to image-feature-extraction? How many models would be covered by this? (i'm wondering if it isn't too specific)

@julien-c
Copy link
Member

(no strong opinion though)

@@ -676,6 +676,11 @@ export const PIPELINE_DATA = {
color: "red",
hideInDatasets: true,
},
"document-retrieval": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds a bit NLP-y to me. Could we use something like "visual-document-retrieval", for symmetry with "visual-question-answering"?

@pcuenca
Copy link
Member

pcuenca commented Jan 17, 2025

How many models would be covered by this? (i'm wondering if it isn't too specific)

I have the same question. Would, for example, all the ColPali models be included here?

@pcuenca
Copy link
Member

pcuenca commented Jan 17, 2025

We'd also have to tag multiple models before merging, as usual.

@merveenoyan
Copy link
Contributor Author

merveenoyan commented Jan 18, 2025

@pcuenca I wanted to wait for the naming consensus before opening PRs to them. we can do visual-document-retrieval yes.

@julien-c it's actually not. those are singular image backbones used to train traditional vision models. these models on the contrast are zero shot models built on VLMs to do document retrieval on multimodal RAG pipelines. they're a bit like CLIP, but for documents, and they're not used for classification (they have long context length and have fine grained image understanding). the number of models keep increasing as number of VLMs increase and they're all wrongly tagged hence this PR. (there's ColPali, ColQwen, ColSmolVLM, DSE models and more now)

here's an tldr explainer on how they're used https://x.com/mervenoyann/status/1831409380040044762?s=46

@julien-c
Copy link
Member

ok, sounds good

Copy link
Member

@Vaibhavs10 Vaibhavs10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe image-document-retrieval so that it's more in sync with the current task vision/ audio related tasks names?

"document-retrieval": {
name: "Document Retrieval",
modality: "multimodal",
color: "yellow",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
color: "yellow",
color: "yellow",
hideInDatasets: true,

Don't think there are many datasets related to this task?

Copy link
Contributor Author

@merveenoyan merveenoyan Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a whole benchmark called ViDoRe and more datasets outside of this benchmark, and similar datasets exist in multiple languages so I would expect more datasets -- https://huggingface.co/spaces/vidore/vidore-leaderboard

for instance these are what I found for Turkish

Copy link
Contributor Author

@merveenoyan merveenoyan Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not in favor of image-document-retrieval, it sounds a bit like images are involved separately and sounds off imo. we have a similar task actually, visual-question-answering which separates from textual QA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants