Skip to content

Add Omni-Fuse Tutorial#2069

Open
hk1510 wants to merge 12 commits into
NVIDIA-NeMo:mainfrom
hk1510:main
Open

Add Omni-Fuse Tutorial#2069
hk1510 wants to merge 12 commits into
NVIDIA-NeMo:mainfrom
hk1510:main

Conversation

@hk1510

@hk1510 hk1510 commented Jun 11, 2026

Copy link
Copy Markdown

Description

This PR adds a tutorial for Omni-Fuse. It follows a similar structure to existing tutorials and builds a curator pipeline to perform retrieval-based data curation for multimodal datasets.

Usage

Instructions to run the tutorial can be found in the README.md file. Open to feedback/adding additional instructions.

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

hk1510 added 2 commits June 11, 2026 19:02
Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>
…modal_hybrid_real.yaml

Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>
@hk1510 hk1510 requested review from a team as code owners June 11, 2026 19:11
@hk1510 hk1510 requested review from abhinavg4 and removed request for a team June 11, 2026 19:11
@copy-pr-bot

copy-pr-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@arhamm1 arhamm1 requested review from suiyoubi and removed request for abhinavg4 June 11, 2026 19:13
@greptile-apps

greptile-apps Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a full Omni-Fuse multimodal data curation tutorial under tutorials/multimodal/omni-fuse-data-curation/. The tutorial implements a NeMo Curator pipeline covering SNS (Symmetric Nucleus Subsampling), EEE (Expert Embedding Engine), projection training, and query-ranked datablend export, with a hybrid API-first / local-model execution model.

  • All issues flagged in previous rounds have been resolved: task_id handling in make_document_batch now uses inspect.signature to conditionally pass the parameter; datablend_size metadata and records=selected are now consistent; image/video files have an NVCF_ASSET_UPLOAD_THRESHOLD_BYTES guard; the duplicate spacy dependency and incorrect uv sync directory in the README are fixed.
  • A minor residual gap remains: the input_audio branch in _describe_chat_completion_file reads the entire file inline without a size guard, which only affects users who configure a custom nvidia_audio_describer_model outside the default AUDIO_URL_CHAT_MODELS set.
  • The make_empty_task fallback via copy.deepcopy(EmptyTask) would silently return a class object rather than an instance if EmptyTask were somehow not callable — unreachable in practice but worth hardening.

Confidence Score: 5/5

The tutorial is safe to merge. All blocking issues from earlier review rounds have been fixed, and the remaining findings are non-default edge cases that do not affect the happy path.

The core pipeline flow — reader → SNS → EEE → projection → datablend — is logically sound end-to-end. The task_id, datablend_size/records, audio extension set, and README setup instructions have all been corrected. The two open observations (audio size guard for custom models, deepcopy fallback) only surface in non-default configurations or are practically unreachable, and neither corrupts data or blocks normal execution.

omnifuse_tutorial/eee/backends.py — the input_audio branch in _describe_chat_completion_file lacks a size guard for custom audio models. omnifuse_tutorial/compat/curator.py — the copy.deepcopy(EmptyTask) fallback is worth hardening.

Important Files Changed

Filename Overview
tutorials/multimodal/omni-fuse-data-curation/omnifuse_tutorial/compat/curator.py Compatibility shim around NeMo Curator APIs; uses inspect.signature to conditionally pass task_id, which handles the field(init=False) edge case; make_empty_task fallback with copy.deepcopy of a class is unreachable in practice.
tutorials/multimodal/omni-fuse-data-curation/omnifuse_tutorial/eee/backends.py EEE API backends; image/video now have NVCF asset-upload size guard; audio with models outside AUDIO_URL_CHAT_MODELS still reads entire file without size check in _describe_chat_completion_file.
tutorials/multimodal/omni-fuse-data-curation/utils.py Shared pipeline helpers; load/resume helpers pass task_id correctly to make_document_batch; effective_sns_backend proxy logic is consistent with sns/backends.py _sns_backend_name auto resolution.
tutorials/multimodal/omni-fuse-data-curation/omnifuse_tutorial/stages/datablend.py Datablend ranking stage; previously fixed mismatch (records=selected, datablend_size=len(selected)) is now consistent.
tutorials/multimodal/omni-fuse-data-curation/omnifuse_tutorial/projection/trainer.py Projection training with both linear (numpy) and torch MLP paths; _recall_at_k is O(N²) which is acceptable for a tutorial; torch.manual_seed(0) sets global seed but is fine in this context.
tutorials/multimodal/omni-fuse-data-curation/omnifuse_tutorial/data/loader.py Paired-pool loader with CSV/JSONL support; redundant raw_path.exists() check on line 81 (file existence already guaranteed by line 79) but harmless.
tutorials/multimodal/omni-fuse-data-curation/pyproject.toml Project manifest; duplicate spacy constraint and README setup directory issues from earlier review have been resolved.
tutorials/multimodal/omni-fuse-data-curation/omnifuse_tutorial/stages/eee.py EEE embedding stage; interleaves raw and annotation embeddings per expert correctly; EmbeddingBundle slicing methods (0::2, 1::2) are consistent with this ordering.
tutorials/multimodal/omni-fuse-data-curation/omnifuse_tutorial/stages/reader.py Pair manifest reader stage; correctly passes task_id and experiment_id as metadata; EmptyTask type annotation is preserved but handled via duck typing in compat shim.
tutorials/multimodal/omni-fuse-data-curation/omnifuse_tutorial/datablend/ranker.py Query-based ranking using cosine similarity; zip(records, projection.projected_raw) silently truncates if lengths differ, but lengths are guaranteed equal by the pipeline flow.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Script as Entry Script
    participant Utils as utils.py
    participant Reader as PairManifestReaderStage
    participant SNS as SNSStage
    participant EEE as EEEEmbeddingStage
    participant Proj as ProjectionTrainingStage
    participant DB as DatablendRankingStage
    participant Compat as compat/curator.py

    Script->>Utils: load_tutorial_config()
    Script->>Utils: run_reader() / load_sns_task()
    Utils->>Compat: make_empty_task()
    Compat-->>Utils: EmptyTask
    Utils->>Reader: process(EmptyTask)
    Reader->>Compat: make_document_batch(task_id, records)
    Compat-->>Reader: DocumentBatch
    Reader-->>Utils: DocumentBatch (with pair records)

    Utils->>SNS: process(DocumentBatch)
    SNS->>Compat: records_from_task(task)
    SNS->>Compat: make_document_batch(task_id, sns_records, metadata)
    Compat-->>SNS: DocumentBatch
    SNS-->>Utils: DocumentBatch (with SNS annotations)

    Utils->>EEE: process(DocumentBatch)
    EEE->>Compat: records_from_task(task)
    Note over EEE: embed_raw + embed_annotation per expert, writes interleaved .npy files
    EEE->>Compat: "make_document_batch(task_id, records, {embedding_bundle})"
    Compat-->>EEE: DocumentBatch
    EEE-->>Utils: DocumentBatch (with EmbeddingBundle in metadata)

    Utils->>Proj: process(DocumentBatch)
    Proj->>Proj: train_and_project(bundle)
    Note over Proj: Linear or Torch MLP projection, writes model.json, embeddings.npy
    Proj->>Compat: "make_document_batch(task_id, records, {projection_result})"
    Compat-->>Proj: DocumentBatch
    Proj-->>Utils: DocumentBatch (with ProjectionResult in metadata)

    Utils->>DB: process(DocumentBatch)
    DB->>DB: DatablendRanker.rank(records, projection)
    DB->>DB: select_top(ranked)
    Note over DB: writes datablend_ranked.jsonl, datablend_topk.jsonl
    DB->>Compat: make_document_batch(task_id, selected)
    Compat-->>DB: DocumentBatch
    DB-->>Utils: DocumentBatch (ranked + selected records)
    Utils-->>Script: final metadata dict
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Script as Entry Script
    participant Utils as utils.py
    participant Reader as PairManifestReaderStage
    participant SNS as SNSStage
    participant EEE as EEEEmbeddingStage
    participant Proj as ProjectionTrainingStage
    participant DB as DatablendRankingStage
    participant Compat as compat/curator.py

    Script->>Utils: load_tutorial_config()
    Script->>Utils: run_reader() / load_sns_task()
    Utils->>Compat: make_empty_task()
    Compat-->>Utils: EmptyTask
    Utils->>Reader: process(EmptyTask)
    Reader->>Compat: make_document_batch(task_id, records)
    Compat-->>Reader: DocumentBatch
    Reader-->>Utils: DocumentBatch (with pair records)

    Utils->>SNS: process(DocumentBatch)
    SNS->>Compat: records_from_task(task)
    SNS->>Compat: make_document_batch(task_id, sns_records, metadata)
    Compat-->>SNS: DocumentBatch
    SNS-->>Utils: DocumentBatch (with SNS annotations)

    Utils->>EEE: process(DocumentBatch)
    EEE->>Compat: records_from_task(task)
    Note over EEE: embed_raw + embed_annotation per expert, writes interleaved .npy files
    EEE->>Compat: "make_document_batch(task_id, records, {embedding_bundle})"
    Compat-->>EEE: DocumentBatch
    EEE-->>Utils: DocumentBatch (with EmbeddingBundle in metadata)

    Utils->>Proj: process(DocumentBatch)
    Proj->>Proj: train_and_project(bundle)
    Note over Proj: Linear or Torch MLP projection, writes model.json, embeddings.npy
    Proj->>Compat: "make_document_batch(task_id, records, {projection_result})"
    Compat-->>Proj: DocumentBatch
    Proj-->>Utils: DocumentBatch (with ProjectionResult in metadata)

    Utils->>DB: process(DocumentBatch)
    DB->>DB: DatablendRanker.rank(records, projection)
    DB->>DB: select_top(ranked)
    Note over DB: writes datablend_ranked.jsonl, datablend_topk.jsonl
    DB->>Compat: make_document_batch(task_id, selected)
    Compat-->>DB: DocumentBatch
    DB-->>Utils: DocumentBatch (ranked + selected records)
    Utils-->>Script: final metadata dict
Loading

Reviews (9): Last reviewed commit: "nvcf upload for large videos" | Re-trigger Greptile

Comment on lines +1 to +69
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Small compatibility layer around NeMo Curator APIs."""

from __future__ import annotations

import copy
from typing import Any

import pandas as pd
from nemo_curator.pipeline import Pipeline as CuratorPipeline
from nemo_curator.tasks import DocumentBatch, EmptyTask


def records_from_task(task: Any) -> list[dict[str, Any]]:
"""Return task data as records from a Curator task."""

data = task.data
if hasattr(data, "to_dict"):
try:
return [dict(row) for row in data.to_dict(orient="records")]
except TypeError:
return [dict(row) for row in data.to_dict("records")]
if isinstance(data, list):
return [dict(row) for row in data]
raise TypeError(f"Unsupported task data type: {type(data)!r}")


def make_document_batch(
task_id: str,
dataset_name: str,
records: list[dict[str, Any]],
metadata: dict[str, Any] | None = None,
stage_perf: list[Any] | None = None,
) -> DocumentBatch:
"""Construct a NeMo Curator DocumentBatch."""

return DocumentBatch(
task_id=task_id,
dataset_name=dataset_name,
data=pd.DataFrame.from_records(records),
_metadata=metadata or {},
_stage_perf=stage_perf or [],
)


def make_empty_task() -> EmptyTask:
if callable(EmptyTask):
try:
return EmptyTask(task_id="empty", dataset_name="omnifuse", data=None)
except TypeError:
return EmptyTask()
return copy.deepcopy(EmptyTask)


def make_curator_pipeline(name: str, stages: list[Any], description: str | None = None) -> CuratorPipeline:
return CuratorPipeline(name=name, description=description, stages=stages)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Missing ProcessingStage, Resources, and re-export of EmptyTask

Every stage file (stages/eee.py, stages/sns.py, stages/projection.py, stages/datablend.py, stages/reader.py) imports ProcessingStage and Resources from this module. Neither is defined here or re-exported from it. stages/reader.py also imports EmptyTask directly. All five stage modules will fail immediately with ImportError when loaded, making the entire pipeline non-runnable.

These classes need to be either imported from nemo_curator and re-exported, or implemented as shim classes within this compat module.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed imports in b8fc71b


from omnifuse_tutorial.compat.curator import make_curator_pipeline, make_empty_task
from omnifuse_tutorial.config.models import ExperimentConfig
from omnifuse_tutorial.data.io import write_json

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Missing omnifuse_tutorial/data/ package

omnifuse_tutorial/data/io.py and omnifuse_tutorial/data/loader.py are imported by at least eight files — pipeline.py, utils.py, stages/eee.py, stages/projection.py, stages/datablend.py, stages/reader.py, sns/backends.py, and projection/trainer.py — but the entire omnifuse_tutorial/data/ directory does not exist in the repository. Every entry point to the tutorial will fail at import time with ModuleNotFoundError: No module named 'omnifuse_tutorial.data'.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in recent commit. Was .gitignored previously.

Comment on lines +492 to +498
def _audio_format(path: Path) -> str:
audio_format = path.suffix.lower().lstrip(".")
if audio_format == "mpeg":
audio_format = "mp3"
if audio_format not in {"wav", "mp3"}:
raise ValueError(f"NVIDIA audio descriptions support wav/mp3 only, got {path.suffix} for {path}")
return audio_format

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 _audio_format raises for extensions declared as supported

AUDIO_EXTENSIONS and MIME_TYPES both include .flac, .m4a, .aac, and .ogg, advertising them as valid inputs. But _audio_format raises ValueError for any extension other than .wav and .mp3. The API description path in NvidiaApiEEEBackend._describe_raw_describe_filedescribe_file_with_nvidia_api calls this function unconditionally for audio records, so a user who provides a .flac or .m4a file will receive a confusing ValueError rather than a clear unsupported-format message.

Reconcile by either removing the unsupported extensions from AUDIO_EXTENSIONS/MIME_TYPES, converting to a supported format before sending, or raising a clear error with guidance early in the audio code path.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b8fc71b

"pandas>=2.0",
"pyyaml>=6.0",
"requests>=2.31",
"spacy>=3.8.14",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Duplicate spacy constraint — the same package appears twice with conflicting lower bounds (>=3.8.14 on line 17 and >=3.8,<4 on line 35). Keep a single, unambiguous entry.

Suggested change
"spacy>=3.8.14",
"spacy>=3.8.14,<4",

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 367d7e9

"qwen-vl-utils>=0.0.8",
"sentence-transformers>=2.2,<4",
"soundfile>=0.12",
"spacy>=3.8,<4",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Remove the second duplicate spacy entry.

Suggested change
"spacy>=3.8,<4",
# spacy consolidated above as spacy>=3.8.14,<4

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 367d7e9

path = Path(value)
except OSError:
return None
return path if path.exists() else path

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _path_or_none always returns a Path, never Nonereturn path if path.exists() else path is a tautology; both branches return the same value. The | None return type annotation is misleading. Either return None when the path does not exist, or simplify to return path.

Suggested change
return path if path.exists() else path
return path

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 367d7e9

Comment on lines +20 to +22

```bash
uv sync --extra dev

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Incorrect setup directory in README

The instruction says to run uv sync --extra dev "from the Curator repository root", but uv sync reads pyproject.toml from the current directory. No uv workspace configuration ties this tutorial to the repository root, so running uv sync from the repo root installs the main Curator package rather than this tutorial's dependencies. Users should cd tutorials/multimodal/omni-fuse-data-curation/ first.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 367d7e9

hk1510 and others added 4 commits June 11, 2026 20:34
Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>
Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>
Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>
Comment on lines +50 to +56
return DocumentBatch(
task_id=task_id,
dataset_name=dataset_name,
data=pd.DataFrame.from_records(records),
_metadata=metadata or {},
_stage_perf=stage_perf or [],
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 task_id is declared as field(init=False, default="") in the Task base class, so it is intentionally excluded from DocumentBatch.__init__. Passing task_id=task_id here raises TypeError: __init__() got an unexpected keyword argument 'task_id' at runtime. Because every stage calls make_document_batch, the entire pipeline fails before any data is processed. The task_id is managed by the framework (set via _set_task_id() at each stage boundary) and must not be supplied to the constructor.

Suggested change
return DocumentBatch(
task_id=task_id,
dataset_name=dataset_name,
data=pd.DataFrame.from_records(records),
_metadata=metadata or {},
_stage_perf=stage_perf or [],
)
return DocumentBatch(
dataset_name=dataset_name,
data=pd.DataFrame.from_records(records),
_metadata=metadata or {},
_stage_perf=stage_perf or [],
)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 21b8091

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted 21b8091 as DocumentBatch requires task_id

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right to revert. My original comment was incorrect — task_id is defined as field(init=False, default="") on the Task base class, which means Python's @dataclass machinery excludes it from __init__ entirely. Passing task_id=task_id will indeed raise TypeError: __init__() got an unexpected keyword argument 'task_id' at runtime.

The current code in the PR still has this bug. The fix from commit 21b8091 (which you reverted) was correct — task_id should be omitted from the DocumentBatch(...) constructor call and set afterwards via _set_task_id() if needed, or simply left to the framework to assign.

Here's the correct version:

Suggested change
return DocumentBatch(
task_id=task_id,
dataset_name=dataset_name,
data=pd.DataFrame.from_records(records),
_metadata=metadata or {},
_stage_perf=stage_perf or [],
)
return DocumentBatch(
dataset_name=dataset_name,
data=pd.DataFrame.from_records(records),
_metadata=metadata or {},
_stage_perf=stage_perf or [],
)

If the caller needs to set a specific task_id, it should be done after construction:

batch = make_document_batch(...)
batch._set_task_id("", task_id)  # or however it fits the pipeline DAG

But in most cases, the framework handles task_id assignment automatically — make_document_batch doesn't need the task_id parameter at all.

Tip: You can customize Greptile's behavior for this repo with .greptile/rules.md and .greptile/config.json.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e5d3594

Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>
Comment on lines +115 to +124
task = run_curator_step(
name=f"{config.experiment_id}-2-embed",
stages=[EEEEmbeddingStage(config=config)],
initial_task=load_sns_task(config),
)
records_path = config.run_dir / "embeddings" / "records.jsonl"
write_jsonl(records_path, records_from_task(task))
metadata = dict(getattr(task, "_metadata", {}) or {})
metadata["embedding_records_path"] = str(records_path)
task._metadata = metadata

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 task_id is not a parameter of make_document_batch

load_sns_task, load_embedding_task, and load_projection_task all call make_document_batch(task_id=..., dataset_name=..., records=..., metadata=...), but the current make_document_batch signature in compat/curator.py is (dataset_name, records, metadata=None, stage_perf=None) — it has no task_id parameter. Every one of these three loaders will raise TypeError: make_document_batch() got an unexpected keyword argument 'task_id' as soon as it is called, breaking all three resume paths (2_embed.py, 3_project.py, 4_datablend.py). Remove the task_id= keyword from all three call sites.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted 21b8091 in 59225f5 so this should no longer be an issue.

DocumentBatch requires task_id

This reverts commit 21b8091.

Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>
hk1510 added 2 commits June 16, 2026 01:52
Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>
Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>
Comment on lines +57 to +68
selected = ranker.select_top(ranked)

output_dir = self.config.run_dir / "datablend"
ranked_path = write_jsonl(output_dir / "datablend_ranked.jsonl", ranked)
topk_path = write_jsonl(output_dir / "datablend_topk.jsonl", selected)
metadata.update(
{
"datablend_ranked_path": str(ranked_path),
"datablend_topk_path": str(topk_path),
"datablend_size": len(selected),
}
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 datablend_size metadata contradicts the batch record count

make_document_batch is called with records=ranked (all N ranked records), but datablend_size in metadata is set to len(selected) (the top-k count). Any downstream code that consumes the output task via records_from_task(task) receives all N records, not datablend_size records — the two values will silently disagree. If the intent is for the curated output batch to contain only the selected subset, records=selected should be passed instead. If the intent is to carry all rankings forward, datablend_size should reflect len(ranked) (or be renamed to datablend_topk_size) so the contract is unambiguous.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in eabec7d

Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>
Comment on lines +267 to +292
def _describe_chat_completion_file(
path: Path,
model: str,
content_type: str,
prompt: str,
api_base_url: str,
headers: dict[str, str],
timeout: int,
) -> str:
encoded = base64.b64encode(path.read_bytes()).decode("utf-8")
mime = MIME_TYPES.get(path.suffix.lower(), "application/octet-stream")
if content_type == "input_audio":
media_content = {"type": "input_audio", "input_audio": {"data": encoded, "format": _audio_format(path)}}
else:
media_content = {content_type: {"url": f"data:{mime};base64,{encoded}"}, "type": content_type}
url = f"{api_base_url}/chat/completions"
payload = {
"model": model,
"messages": [{"role": "user", "content": [{"type": "text", "text": prompt}, media_content]}],
"max_tokens": 512,
"temperature": 0.2,
"stream": False,
}
response = _post_nvidia_json_with_retries(url, headers, payload, timeout)
response = _resolve_nvidia_response(response, api_base_url, headers, timeout, model, url)
return _response_text(response.json(), model, url)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Video/image files base64-encoded inline with no size guard

_describe_chat_completion_file reads the entire file with path.read_bytes() and embeds it as a base64 data-URL in the JSON body for both image_url and video_url content types. The audio code path has an explicit NVCF_ASSET_UPLOAD_THRESHOLD_BYTES = 180 * 1024 check and falls back to NVCF asset upload for large files. No equivalent guard exists here. A typical 10-second video file is easily 10–50 MB before encoding; base64 adds ~33 %, producing a JSON body the NVIDIA API will reject with HTTP 413 or a generic error — the same limit that drove the audio threshold. This silently breaks the video modality for any non-trivial input. Consider adding a size check and routing large image/video files through _upload_nvcf_asset (or raising an early ValueError with a clear message) just as audio already does.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 2b13f6f

Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant