Add Omni-Fuse Tutorial by hk1510 · Pull Request #2069 · NVIDIA-NeMo/Curator

hk1510 · 2026-06-11T19:11:12Z

Description

This PR adds a tutorial for Omni-Fuse. It follows a similar structure to existing tutorials and builds a curator pipeline to perform retrieval-based data curation for multimodal datasets.

Usage

Instructions to run the tutorial can be found in the README.md file. Open to feedback/adding additional instructions.

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>

…modal_hybrid_real.yaml Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>

copy-pr-bot · 2026-06-11T19:11:16Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-06-11T19:16:40Z

Greptile Summary

This PR adds a full Omni-Fuse multimodal data curation tutorial under tutorials/multimodal/omni-fuse-data-curation/. The tutorial implements a NeMo Curator pipeline covering SNS (Symmetric Nucleus Subsampling), EEE (Expert Embedding Engine), projection training, and query-ranked datablend export, with a hybrid API-first / local-model execution model.

All issues flagged in previous rounds have been resolved: task_id handling in make_document_batch now uses inspect.signature to conditionally pass the parameter; datablend_size metadata and records=selected are now consistent; image/video files have an NVCF_ASSET_UPLOAD_THRESHOLD_BYTES guard; the duplicate spacy dependency and incorrect uv sync directory in the README are fixed.
A minor residual gap remains: the input_audio branch in _describe_chat_completion_file reads the entire file inline without a size guard, which only affects users who configure a custom nvidia_audio_describer_model outside the default AUDIO_URL_CHAT_MODELS set.
The make_empty_task fallback via copy.deepcopy(EmptyTask) would silently return a class object rather than an instance if EmptyTask were somehow not callable — unreachable in practice but worth hardening.

Confidence Score: 5/5

The tutorial is safe to merge. All blocking issues from earlier review rounds have been fixed, and the remaining findings are non-default edge cases that do not affect the happy path.

The core pipeline flow — reader → SNS → EEE → projection → datablend — is logically sound end-to-end. The task_id, datablend_size/records, audio extension set, and README setup instructions have all been corrected. The two open observations (audio size guard for custom models, deepcopy fallback) only surface in non-default configurations or are practically unreachable, and neither corrupts data or blocks normal execution.

omnifuse_tutorial/eee/backends.py — the input_audio branch in _describe_chat_completion_file lacks a size guard for custom audio models. omnifuse_tutorial/compat/curator.py — the copy.deepcopy(EmptyTask) fallback is worth hardening.

Important Files Changed

Filename	Overview
tutorials/multimodal/omni-fuse-data-curation/omnifuse_tutorial/compat/curator.py	Compatibility shim around NeMo Curator APIs; uses inspect.signature to conditionally pass task_id, which handles the field(init=False) edge case; make_empty_task fallback with copy.deepcopy of a class is unreachable in practice.
tutorials/multimodal/omni-fuse-data-curation/omnifuse_tutorial/eee/backends.py	EEE API backends; image/video now have NVCF asset-upload size guard; audio with models outside AUDIO_URL_CHAT_MODELS still reads entire file without size check in _describe_chat_completion_file.
tutorials/multimodal/omni-fuse-data-curation/utils.py	Shared pipeline helpers; load/resume helpers pass task_id correctly to make_document_batch; effective_sns_backend proxy logic is consistent with sns/backends.py _sns_backend_name auto resolution.
tutorials/multimodal/omni-fuse-data-curation/omnifuse_tutorial/stages/datablend.py	Datablend ranking stage; previously fixed mismatch (records=selected, datablend_size=len(selected)) is now consistent.
tutorials/multimodal/omni-fuse-data-curation/omnifuse_tutorial/projection/trainer.py	Projection training with both linear (numpy) and torch MLP paths; _recall_at_k is O(N²) which is acceptable for a tutorial; torch.manual_seed(0) sets global seed but is fine in this context.
tutorials/multimodal/omni-fuse-data-curation/omnifuse_tutorial/data/loader.py	Paired-pool loader with CSV/JSONL support; redundant raw_path.exists() check on line 81 (file existence already guaranteed by line 79) but harmless.
tutorials/multimodal/omni-fuse-data-curation/pyproject.toml	Project manifest; duplicate spacy constraint and README setup directory issues from earlier review have been resolved.
tutorials/multimodal/omni-fuse-data-curation/omnifuse_tutorial/stages/eee.py	EEE embedding stage; interleaves raw and annotation embeddings per expert correctly; EmbeddingBundle slicing methods (0::2, 1::2) are consistent with this ordering.
tutorials/multimodal/omni-fuse-data-curation/omnifuse_tutorial/stages/reader.py	Pair manifest reader stage; correctly passes task_id and experiment_id as metadata; EmptyTask type annotation is preserved but handled via duck typing in compat shim.
tutorials/multimodal/omni-fuse-data-curation/omnifuse_tutorial/datablend/ranker.py	Query-based ranking using cosine similarity; zip(records, projection.projected_raw) silently truncates if lengths differ, but lengths are guaranteed equal by the pipeline flow.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Script as Entry Script
    participant Utils as utils.py
    participant Reader as PairManifestReaderStage
    participant SNS as SNSStage
    participant EEE as EEEEmbeddingStage
    participant Proj as ProjectionTrainingStage
    participant DB as DatablendRankingStage
    participant Compat as compat/curator.py

    Script->>Utils: load_tutorial_config()
    Script->>Utils: run_reader() / load_sns_task()
    Utils->>Compat: make_empty_task()
    Compat-->>Utils: EmptyTask
    Utils->>Reader: process(EmptyTask)
    Reader->>Compat: make_document_batch(task_id, records)
    Compat-->>Reader: DocumentBatch
    Reader-->>Utils: DocumentBatch (with pair records)

    Utils->>SNS: process(DocumentBatch)
    SNS->>Compat: records_from_task(task)
    SNS->>Compat: make_document_batch(task_id, sns_records, metadata)
    Compat-->>SNS: DocumentBatch
    SNS-->>Utils: DocumentBatch (with SNS annotations)

    Utils->>EEE: process(DocumentBatch)
    EEE->>Compat: records_from_task(task)
    Note over EEE: embed_raw + embed_annotation per expert, writes interleaved .npy files
    EEE->>Compat: "make_document_batch(task_id, records, {embedding_bundle})"
    Compat-->>EEE: DocumentBatch
    EEE-->>Utils: DocumentBatch (with EmbeddingBundle in metadata)

    Utils->>Proj: process(DocumentBatch)
    Proj->>Proj: train_and_project(bundle)
    Note over Proj: Linear or Torch MLP projection, writes model.json, embeddings.npy
    Proj->>Compat: "make_document_batch(task_id, records, {projection_result})"
    Compat-->>Proj: DocumentBatch
    Proj-->>Utils: DocumentBatch (with ProjectionResult in metadata)

    Utils->>DB: process(DocumentBatch)
    DB->>DB: DatablendRanker.rank(records, projection)
    DB->>DB: select_top(ranked)
    Note over DB: writes datablend_ranked.jsonl, datablend_topk.jsonl
    DB->>Compat: make_document_batch(task_id, selected)
    Compat-->>DB: DocumentBatch
    DB-->>Utils: DocumentBatch (ranked + selected records)
    Utils-->>Script: final metadata dict

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Script as Entry Script
    participant Utils as utils.py
    participant Reader as PairManifestReaderStage
    participant SNS as SNSStage
    participant EEE as EEEEmbeddingStage
    participant Proj as ProjectionTrainingStage
    participant DB as DatablendRankingStage
    participant Compat as compat/curator.py

    Script->>Utils: load_tutorial_config()
    Script->>Utils: run_reader() / load_sns_task()
    Utils->>Compat: make_empty_task()
    Compat-->>Utils: EmptyTask
    Utils->>Reader: process(EmptyTask)
    Reader->>Compat: make_document_batch(task_id, records)
    Compat-->>Reader: DocumentBatch
    Reader-->>Utils: DocumentBatch (with pair records)

    Utils->>SNS: process(DocumentBatch)
    SNS->>Compat: records_from_task(task)
    SNS->>Compat: make_document_batch(task_id, sns_records, metadata)
    Compat-->>SNS: DocumentBatch
    SNS-->>Utils: DocumentBatch (with SNS annotations)

    Utils->>EEE: process(DocumentBatch)
    EEE->>Compat: records_from_task(task)
    Note over EEE: embed_raw + embed_annotation per expert, writes interleaved .npy files
    EEE->>Compat: "make_document_batch(task_id, records, {embedding_bundle})"
    Compat-->>EEE: DocumentBatch
    EEE-->>Utils: DocumentBatch (with EmbeddingBundle in metadata)

    Utils->>Proj: process(DocumentBatch)
    Proj->>Proj: train_and_project(bundle)
    Note over Proj: Linear or Torch MLP projection, writes model.json, embeddings.npy
    Proj->>Compat: "make_document_batch(task_id, records, {projection_result})"
    Compat-->>Proj: DocumentBatch
    Proj-->>Utils: DocumentBatch (with ProjectionResult in metadata)

    Utils->>DB: process(DocumentBatch)
    DB->>DB: DatablendRanker.rank(records, projection)
    DB->>DB: select_top(ranked)
    Note over DB: writes datablend_ranked.jsonl, datablend_topk.jsonl
    DB->>Compat: make_document_batch(task_id, selected)
    Compat-->>DB: DocumentBatch
    DB-->>Utils: DocumentBatch (ranked + selected records)
    Utils-->>Script: final metadata dict

_{Reviews (9): Last reviewed commit: "nvcf upload for large videos" | Re-trigger Greptile}

greptile-apps · 2026-06-11T19:16:44Z

+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Small compatibility layer around NeMo Curator APIs."""
+
+from __future__ import annotations
+
+import copy
+from typing import Any
+
+import pandas as pd
+from nemo_curator.pipeline import Pipeline as CuratorPipeline
+from nemo_curator.tasks import DocumentBatch, EmptyTask
+
+
+def records_from_task(task: Any) -> list[dict[str, Any]]:
+    """Return task data as records from a Curator task."""
+
+    data = task.data
+    if hasattr(data, "to_dict"):
+        try:
+            return [dict(row) for row in data.to_dict(orient="records")]
+        except TypeError:
+            return [dict(row) for row in data.to_dict("records")]
+    if isinstance(data, list):
+        return [dict(row) for row in data]
+    raise TypeError(f"Unsupported task data type: {type(data)!r}")
+
+
+def make_document_batch(
+    task_id: str,
+    dataset_name: str,
+    records: list[dict[str, Any]],
+    metadata: dict[str, Any] | None = None,
+    stage_perf: list[Any] | None = None,
+) -> DocumentBatch:
+    """Construct a NeMo Curator DocumentBatch."""
+
+    return DocumentBatch(
+        task_id=task_id,
+        dataset_name=dataset_name,
+        data=pd.DataFrame.from_records(records),
+        _metadata=metadata or {},
+        _stage_perf=stage_perf or [],
+    )
+
+
+def make_empty_task() -> EmptyTask:
+    if callable(EmptyTask):
+        try:
+            return EmptyTask(task_id="empty", dataset_name="omnifuse", data=None)
+        except TypeError:
+            return EmptyTask()
+    return copy.deepcopy(EmptyTask)
+
+
+def make_curator_pipeline(name: str, stages: list[Any], description: str | None = None) -> CuratorPipeline:
+    return CuratorPipeline(name=name, description=description, stages=stages)


Missing ProcessingStage, Resources, and re-export of EmptyTask

Every stage file (stages/eee.py, stages/sns.py, stages/projection.py, stages/datablend.py, stages/reader.py) imports ProcessingStage and Resources from this module. Neither is defined here or re-exported from it. stages/reader.py also imports EmptyTask directly. All five stage modules will fail immediately with ImportError when loaded, making the entire pipeline non-runnable.

These classes need to be either imported from nemo_curator and re-exported, or implemented as shim classes within this compat module.

Fixed imports in b8fc71b

greptile-apps · 2026-06-11T19:16:45Z

+
+from omnifuse_tutorial.compat.curator import make_curator_pipeline, make_empty_task
+from omnifuse_tutorial.config.models import ExperimentConfig
+from omnifuse_tutorial.data.io import write_json


Missing omnifuse_tutorial/data/ package

omnifuse_tutorial/data/io.py and omnifuse_tutorial/data/loader.py are imported by at least eight files — pipeline.py, utils.py, stages/eee.py, stages/projection.py, stages/datablend.py, stages/reader.py, sns/backends.py, and projection/trainer.py — but the entire omnifuse_tutorial/data/ directory does not exist in the repository. Every entry point to the tutorial will fail at import time with ModuleNotFoundError: No module named 'omnifuse_tutorial.data'.

Added in recent commit. Was .gitignored previously.

greptile-apps · 2026-06-11T19:16:46Z

+def _audio_format(path: Path) -> str:
+    audio_format = path.suffix.lower().lstrip(".")
+    if audio_format == "mpeg":
+        audio_format = "mp3"
+    if audio_format not in {"wav", "mp3"}:
+        raise ValueError(f"NVIDIA audio descriptions support wav/mp3 only, got {path.suffix} for {path}")
+    return audio_format


_audio_format raises for extensions declared as supported

AUDIO_EXTENSIONS and MIME_TYPES both include .flac, .m4a, .aac, and .ogg, advertising them as valid inputs. But _audio_format raises ValueError for any extension other than .wav and .mp3. The API description path in NvidiaApiEEEBackend._describe_raw → _describe_file → describe_file_with_nvidia_api calls this function unconditionally for audio records, so a user who provides a .flac or .m4a file will receive a confusing ValueError rather than a clear unsupported-format message.

Reconcile by either removing the unsupported extensions from AUDIO_EXTENSIONS/MIME_TYPES, converting to a supported format before sending, or raising a clear error with guidance early in the audio code path.

Fixed in b8fc71b

greptile-apps · 2026-06-11T19:16:47Z

+    "pandas>=2.0",
+    "pyyaml>=6.0",
+    "requests>=2.31",
+    "spacy>=3.8.14",


Duplicate spacy constraint — the same package appears twice with conflicting lower bounds (>=3.8.14 on line 17 and >=3.8,<4 on line 35). Keep a single, unambiguous entry.

Suggested change

"spacy>=3.8.14",

"spacy>=3.8.14,<4",

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Fixed in 367d7e9

greptile-apps · 2026-06-11T19:16:48Z

+    "qwen-vl-utils>=0.0.8",
+    "sentence-transformers>=2.2,<4",
+    "soundfile>=0.12",
+    "spacy>=3.8,<4",


Remove the second duplicate spacy entry.

Suggested change

"spacy>=3.8,<4",

# spacy consolidated above as spacy>=3.8.14,<4

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Fixed in 367d7e9

greptile-apps · 2026-06-11T19:16:49Z

+        path = Path(value)
+    except OSError:
+        return None
+    return path if path.exists() else path


_path_or_none always returns a Path, never None — return path if path.exists() else path is a tautology; both branches return the same value. The | None return type annotation is misleading. Either return None when the path does not exist, or simplify to return path.

Suggested change

return path if path.exists() else path

return path

Fixed in 367d7e9

greptile-apps · 2026-06-11T19:16:50Z

+
+```bash
+uv sync --extra dev


Incorrect setup directory in README

The instruction says to run uv sync --extra dev "from the Curator repository root", but uv sync reads pyproject.toml from the current directory. No uv workspace configuration ties this tutorial to the repository root, so running uv sync from the repo root installs the main Curator package rather than this tutorial's dependencies. Users should cd tutorials/multimodal/omni-fuse-data-curation/ first.

Fixed in 367d7e9

Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>

greptile-apps · 2026-06-11T21:07:14Z

+    return DocumentBatch(
+        task_id=task_id,
+        dataset_name=dataset_name,
+        data=pd.DataFrame.from_records(records),
+        _metadata=metadata or {},
+        _stage_perf=stage_perf or [],
+    )


task_id is declared as field(init=False, default="") in the Task base class, so it is intentionally excluded from DocumentBatch.__init__. Passing task_id=task_id here raises TypeError: __init__() got an unexpected keyword argument 'task_id' at runtime. Because every stage calls make_document_batch, the entire pipeline fails before any data is processed. The task_id is managed by the framework (set via _set_task_id() at each stage boundary) and must not be supplied to the constructor.

Suggested change

return DocumentBatch(

task_id=task_id,

dataset_name=dataset_name,

data=pd.DataFrame.from_records(records),

_metadata=metadata or {},

_stage_perf=stage_perf or [],

)

return DocumentBatch(

dataset_name=dataset_name,

data=pd.DataFrame.from_records(records),

_metadata=metadata or {},

_stage_perf=stage_perf or [],

)

Fixed in 21b8091

Reverted 21b8091 as DocumentBatch requires task_id

You're right to revert. My original comment was incorrect — task_id is defined as field(init=False, default="") on the Task base class, which means Python's @dataclass machinery excludes it from __init__ entirely. Passing task_id=task_id will indeed raise TypeError: __init__() got an unexpected keyword argument 'task_id' at runtime.

The current code in the PR still has this bug. The fix from commit 21b8091 (which you reverted) was correct — task_id should be omitted from the DocumentBatch(...) constructor call and set afterwards via _set_task_id() if needed, or simply left to the framework to assign.

Here's the correct version:

Suggested change

return DocumentBatch(

task_id=task_id,

dataset_name=dataset_name,

data=pd.DataFrame.from_records(records),

_metadata=metadata or {},

_stage_perf=stage_perf or [],

)

return DocumentBatch(

dataset_name=dataset_name,

data=pd.DataFrame.from_records(records),

_metadata=metadata or {},

_stage_perf=stage_perf or [],

)

If the caller needs to set a specific task_id, it should be done after construction:

batch = make_document_batch(...) batch._set_task_id("", task_id) # or however it fits the pipeline DAG

But in most cases, the framework handles task_id assignment automatically — make_document_batch doesn't need the task_id parameter at all.

Tip: You can customize Greptile's behavior for this repo with .greptile/rules.md and .greptile/config.json.

Fixed in e5d3594

Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>

greptile-apps · 2026-06-11T21:21:42Z

+    task = run_curator_step(
+        name=f"{config.experiment_id}-2-embed",
+        stages=[EEEEmbeddingStage(config=config)],
+        initial_task=load_sns_task(config),
+    )
+    records_path = config.run_dir / "embeddings" / "records.jsonl"
+    write_jsonl(records_path, records_from_task(task))
+    metadata = dict(getattr(task, "_metadata", {}) or {})
+    metadata["embedding_records_path"] = str(records_path)
+    task._metadata = metadata


task_id is not a parameter of make_document_batch

load_sns_task, load_embedding_task, and load_projection_task all call make_document_batch(task_id=..., dataset_name=..., records=..., metadata=...), but the current make_document_batch signature in compat/curator.py is (dataset_name, records, metadata=None, stage_perf=None) — it has no task_id parameter. Every one of these three loaders will raise TypeError: make_document_batch() got an unexpected keyword argument 'task_id' as soon as it is called, breaking all three resume paths (2_embed.py, 3_project.py, 4_datablend.py). Remove the task_id= keyword from all three call sites.

Reverted 21b8091 in 59225f5 so this should no longer be an issue.

DocumentBatch requires task_id This reverts commit 21b8091. Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>

Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>

greptile-apps · 2026-06-16T02:30:32Z

+        selected = ranker.select_top(ranked)
+
+        output_dir = self.config.run_dir / "datablend"
+        ranked_path = write_jsonl(output_dir / "datablend_ranked.jsonl", ranked)
+        topk_path = write_jsonl(output_dir / "datablend_topk.jsonl", selected)
+        metadata.update(
+            {
+                "datablend_ranked_path": str(ranked_path),
+                "datablend_topk_path": str(topk_path),
+                "datablend_size": len(selected),
+            }
+        )


datablend_size metadata contradicts the batch record count

make_document_batch is called with records=ranked (all N ranked records), but datablend_size in metadata is set to len(selected) (the top-k count). Any downstream code that consumes the output task via records_from_task(task) receives all N records, not datablend_size records — the two values will silently disagree. If the intent is for the curated output batch to contain only the selected subset, records=selected should be passed instead. If the intent is to carry all rankings forward, datablend_size should reflect len(ranked) (or be renamed to datablend_topk_size) so the contract is unambiguous.

Fixed in eabec7d

Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>

greptile-apps · 2026-06-16T02:56:44Z

+def _describe_chat_completion_file(
+    path: Path,
+    model: str,
+    content_type: str,
+    prompt: str,
+    api_base_url: str,
+    headers: dict[str, str],
+    timeout: int,
+) -> str:
+    encoded = base64.b64encode(path.read_bytes()).decode("utf-8")
+    mime = MIME_TYPES.get(path.suffix.lower(), "application/octet-stream")
+    if content_type == "input_audio":
+        media_content = {"type": "input_audio", "input_audio": {"data": encoded, "format": _audio_format(path)}}
+    else:
+        media_content = {content_type: {"url": f"data:{mime};base64,{encoded}"}, "type": content_type}
+    url = f"{api_base_url}/chat/completions"
+    payload = {
+        "model": model,
+        "messages": [{"role": "user", "content": [{"type": "text", "text": prompt}, media_content]}],
+        "max_tokens": 512,
+        "temperature": 0.2,
+        "stream": False,
+    }
+    response = _post_nvidia_json_with_retries(url, headers, payload, timeout)
+    response = _resolve_nvidia_response(response, api_base_url, headers, timeout, model, url)
+    return _response_text(response.json(), model, url)


Video/image files base64-encoded inline with no size guard

_describe_chat_completion_file reads the entire file with path.read_bytes() and embeds it as a base64 data-URL in the JSON body for both image_url and video_url content types. The audio code path has an explicit NVCF_ASSET_UPLOAD_THRESHOLD_BYTES = 180 * 1024 check and falls back to NVCF asset upload for large files. No equivalent guard exists here. A typical 10-second video file is easily 10–50 MB before encoding; base64 adds ~33 %, producing a JSON body the NVIDIA API will reject with HTTP 413 or a generic error — the same limit that drove the audio threshold. This silently breaks the video modality for any non-trivial input. Consider adding a size check and routing large image/video files through _upload_nvcf_asset (or raising an early ValueError with a clear message) just as audio already does.

Fixed in 2b13f6f

Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>

hk1510 added 2 commits June 11, 2026 19:02

Add omni-fuse tutorial

5fb1146

Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>

Delete tutorials/multimodal/omni-fuse-data-curation/configs/e2e_multi…

cfda623

…modal_hybrid_real.yaml Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>

hk1510 requested review from a team as code owners June 11, 2026 19:11

hk1510 requested review from abhinavg4 and removed request for a team June 11, 2026 19:11

arhamm1 requested review from suiyoubi and removed request for abhinavg4 June 11, 2026 19:13

greptile-apps Bot reviewed Jun 11, 2026

View reviewed changes

hk1510 and others added 4 commits June 11, 2026 20:34

add data folder (previously .gitignored)

f17b24e

Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>

fix imports and audio extensions

b8fc71b

Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>

P2 greptile fixes

367d7e9

Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>

Merge branch 'main' into main

d47730a

greptile-apps Bot reviewed Jun 11, 2026

View reviewed changes

remove task_id from DocumentBatch init

21b8091

Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>

greptile-apps Bot reviewed Jun 11, 2026

View reviewed changes

Revert "remove task_id from DocumentBatch init"

59225f5

DocumentBatch requires task_id This reverts commit 21b8091. Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>

hk1510 force-pushed the main branch from ac94ef2 to 59225f5 Compare June 11, 2026 21:37

hk1510 added 2 commits June 16, 2026 01:52

add compatibilty with older curator versions

e5d3594

Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>

remove default sns backend factory and use numpy in write_npy

dc27a6c

Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>

greptile-apps Bot reviewed Jun 16, 2026

View reviewed changes

fix metadata mismatch

eabec7d

Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>

greptile-apps Bot reviewed Jun 16, 2026

View reviewed changes

nvcf upload for large videos

2b13f6f

Signed-off-by: Harshil Kotamreddy <hkotamreddy@nvidia.com>

	"spacy>=3.8,<4",
	# spacy consolidated above as spacy>=3.8.14,<4

Uh oh!

Conversation

hk1510 commented Jun 11, 2026

Description

Usage

Checklist

Uh oh!

copy-pr-bot Bot commented Jun 11, 2026

Uh oh!

greptile-apps Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

greptile-apps Bot commented Jun 11, 2026 •

edited

Loading