Skip to content

Add model-based HTML extraction stage#1768

Open
zeel2104 wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
zeel2104:zeel/model-based-html-extraction
Open

Add model-based HTML extraction stage#1768
zeel2104 wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
zeel2104:zeel/model-based-html-extraction

Conversation

@zeel2104

@zeel2104 zeel2104 commented Apr 8, 2026

Copy link
Copy Markdown

Description

Adds a model-based HTML extraction algorithm for Common Crawl extraction.

This introduces ModelBasedHTMLExtractionStage, which classifies candidate HTML elements, preserves structured content such as fenced code blocks, math formulas, and Markdown tables, and falls back to Trafilatura when model confidence is low. It is wired into CommonCrawlHTMLExtractor via algorithm="model" and algorithm="model_based".

Closes #1723.

Usage

from nemo_curator.stages.text.download.common_crawl.extract import CommonCrawlHTMLExtractor

html_extractor = CommonCrawlHTMLExtractor(
    algorithm="model",
    algorithm_kwargs={
        "model_identifier": "opendatalab/MinerU-HTML-0.6B",
        "output_format": "markdown",
        "fallback_threshold": 0.65,
        "device": "cuda",
        "batch_size": 64,
    },
)

## Checklist
<!--
Note: All commits need to be signed and signed off. This can be done via `-sS` flags while commiting
`git commit -sS -m "...."
-->
- [ x] I am familiar with the [Contributing Guide](https://github.com/NVIDIA-NeMo/Curator/blob/main/CONTRIBUTING.md).
- [ x] New or Existing tests cover these changes.
- [ ] The documentation is up to date with these changes.

## Tests

uv run pytest tests/stages/text/download/test_model_based_html_extractor.py -q
uv run ruff check nemo_curator/stages/text/download/html_extractors/model_based.py tests/stages/text/download/test_model_based_html_extractor.py nemo_curator/stages/text/download/common_crawl/extract.py

@zeel2104 zeel2104 requested a review from a team as a code owner April 8, 2026 17:14
@zeel2104 zeel2104 requested review from huvunvidia and removed request for a team April 8, 2026 17:14
@copy-pr-bot

copy-pr-bot Bot commented Apr 8, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a ModelBasedHTMLExtractionStage that classifies HTML candidate elements using a Hugging Face sequence classifier, preserves structured content (code blocks, math, tables) as Markdown, and falls back to Trafilatura when confidence is low. The building blocks — candidate extraction, inference, and assembly stages — are well-implemented, but the integration with the pipeline entry point is incomplete.

  • CommonCrawlDownloadExtractStage(html_extraction=\"model\") raises a ValueError at construction, because stage.py was not updated to route model-based requests through the new multi-stage pipeline; the error message incorrectly directs users to the same broken path.
  • CommonCrawlModelBasedCandidateExtractor.extract returns list[dict] while the DocumentExtractor base class declares dict | None, which would cause a TypeError in DocumentIterateExtractStage when the extractor is eventually wired in.

Confidence Score: 3/5

The new model-based stages are architecturally sound but the end-to-end feature is non-functional — the pipeline entry point raises an error and the candidate extractor has a type contract violation that would crash the iterator stage when wired up.

Two defects affect the core advertised feature: CommonCrawlDownloadExtractStage(html_extraction='model') fails at construction with a self-referential error (stage.py was not updated), and CommonCrawlModelBasedCandidateExtractor.extract returns a list where the base class and DocumentIterateExtractStage expect a dict. Both must be resolved before the feature is usable.

nemo_curator/stages/text/download/common_crawl/extract.py and nemo_curator/stages/text/download/common_crawl/stage.py need attention — the wiring between the new multi-stage pipeline and the existing CommonCrawlDownloadExtractStage is absent.

Important Files Changed

Filename Overview
nemo_curator/stages/text/download/common_crawl/extract.py Adds CommonCrawlModelBasedCandidateExtractor and model/model_based algorithm detection, but the ValueError for those strings is self-referential (stage.py still routes through this class), making the feature's pipeline entry point unreachable; also CommonCrawlModelBasedCandidateExtractor.extract returns list[dict] while the base class declares dict
nemo_curator/stages/text/download/html_extractors/model_based.py New file implementing element extraction, classification, and assembly stages with code/formula/table rendering; logic is sound but AssembleModelBasedHTMLExtractionStage uses full HTML content as a groupby key unnecessarily.
nemo_curator/stages/text/download/html_extractors/init.py Exports new model-based types from the package's public API; no issues.
tests/stages/text/download/test_model_based_html_extractor.py Tests cover helper functions, rendering, lifecycle delegation, and the ValueError path; no test exercises CommonCrawlDownloadExtractStage(html_extraction='model') which would have caught the broken wiring.

Sequence Diagram

sequenceDiagram
    participant U as User
    participant CCDES as CommonCrawlDownloadExtractStage
    participant CCMCE as CommonCrawlModelBasedCandidateExtractor
    participant TS as TokenizerStage
    participant MBHIS as ModelBasedHTMLInferenceStage
    participant AMBHES as AssembleModelBasedHTMLExtractionStage

    note over CCDES,AMBHES: Intended model-based pipeline (NOT yet wired in stage.py)

    U->>CCDES: "html_extraction=model"
    CCDES-->>U: ValueError (self-referential error)

    note over CCDES,AMBHES: Correct flow once stage.py is updated

    CCDES->>CCMCE: extract HTML candidates per WARC record
    CCMCE->>CCMCE: extract_candidate_elements(BeautifulSoup)
    CCMCE-->>CCDES: list of candidate rows (1 row per HTML element)

    CCDES->>TS: tokenize MODEL_INPUT_FIELD column
    TS-->>CCDES: input_ids + attention_mask columns

    CCDES->>MBHIS: classify elements (batched GPU inference)
    MBHIS-->>CCDES: candidate_label + candidate_confidence columns

    CCDES->>AMBHES: assemble per-document text
    AMBHES->>AMBHES: group by (url, warc_id, source_id, language)
    AMBHES->>AMBHES: render accepted elements to markdown/plain
    alt "mean_confidence >= fallback_threshold"
        AMBHES-->>CCDES: text from model predictions
    else low confidence
        AMBHES->>AMBHES: TrafilaturaExtractor.extract_text (fallback)
        AMBHES-->>CCDES: text from fallback
    end
Loading

Reviews (9): Last reviewed commit: "Remove inline model execution from HTML ..." | Re-trigger Greptile

classifier: HTMLElementClassifier | None = None,
fallback_extractor: HTMLExtractorAlgorithm | None = None,
transformers_init_kwargs: dict[str, Any] | None = None,
):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 local_files_only=True blocks model download out of the box

The default local_files_only=True means AutoTokenizer.from_pretrained and AutoModelForSequenceClassification.from_pretrained will refuse to hit the Hugging Face Hub and will raise an OSError (e.g., "Can't load tokenizer for 'opendatalab/MinerU-HTML-0.6B'") unless the model is already in the local cache. Since the default model_identifier is a public Hub model and the README usage example omits this parameter, any first-time user who follows the example will get an immediate failure with a confusing error message. The default should be False so the model is downloaded automatically.

Suggested change
):
local_files_only: bool = False,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 local_files_only=False should be used in setup_on_node and local_files_only=True should be used in setup.

Comment on lines +396 to +399
for line in block.splitlines():
cells = [cell.strip() for cell in line.strip("|").split("|")]
if cells and not all(cell == "---" for cell in cells):
rows.append("\t".join(cells))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Aligned separator rows not filtered in plain-text table conversion

The separator filter checks all(cell == "---" for cell in cells), but GFM alignment indicators (":---", "---:", ":---:") are also valid separator row values. A table rendered with column alignment will have separator rows like | :--- | ---: |, which won't match "---" exactly and will be emitted as a data row in plain-text output. Consider using cell.strip(":") == "---" or a regex like r":?-{3,}:?" for the check.

Comment on lines +349 to +352
fence = "```"
if fence in code:
fence = "````"
return f"{fence}{language}\n{code.strip()}\n{fence}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Fence escalation only covers one level

If the code content contains four consecutive backticks, neither the default ``` fence nor the escalated ```` fence will correctly delimit the block, producing broken Markdown. A more robust approach finds the longest run of backticks and uses one more:

import re
max_run = max((len(m.group()) for m in re.finditer(r"`+", code)), default=0)
fence = "`" * max(3, max_run + 1)

@sarahyurick sarahyurick self-requested a review April 8, 2026 18:24
Signed-off-by: Zeel <desaizeel2128@gmail.com>
@zeel2104 zeel2104 force-pushed the zeel/model-based-html-extraction branch from 30944cf to 6b187e7 Compare April 8, 2026 20:15
@zeel2104

zeel2104 commented Apr 8, 2026

Copy link
Copy Markdown
Author

Fixed: (after comments by greptile-bot)

  • changed local_files_only default to False so first-time use can download the model
  • made code fence generation robust to arbitrary backtick runs
  • filtered GFM alignment separator rows in plain-text table conversion
  • added test coverage for the plain-text table output path

@greptile-apps

greptile-apps Bot commented Apr 8, 2026

Copy link
Copy Markdown
Contributor

Tip:

Greploops — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

@sarahyurick sarahyurick left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zeel2104 ! I left a few high-level comments about common Curator patterns and suggestions for changes. Please let me know if you have any questions.

self,
model_identifier: str,
cache_dir: str | None,
device: Literal["cuda", "cpu"],

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it should be GPU only.

self._model: Any | None = None
self._tokenizer: Any | None = None

def _setup(self) -> None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a setup_on_node function which uses huggingface_hub's snapshot_download function to make sure we only download the model once.

Then, when the model is loaded with setup, it can do local_files_only=True.

Comment on lines +114 to +115
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be top-level imports.

Comment on lines +129 to +131
if self.device == "cuda" and not torch.cuda.is_available():
logger.warning("CUDA requested for model-based HTML extraction, but CUDA is unavailable. Using CPU.")
self.device = "cpu"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should probably error in this case instead of warning.

self._model.eval()

def predict(self, elements: list[HTMLElement]) -> list[HTMLElementPrediction]:
self._setup()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setup function should never be called directly by predict or any other function within the stage. When running it as part as a Pipeline, the executor will call it.

def predict(self, elements: list[HTMLElement]) -> list[HTMLElementPrediction]:
self._setup()

import torch

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a top-level import.

Comment on lines +148 to +161
for start in range(0, len(elements), self.batch_size):
batch = elements[start : start + self.batch_size]
model_inputs = [self._format_element(element) for element in batch]
encoded = tokenizer(
model_inputs,
padding=True,
truncation=True,
max_length=self.max_length,
return_tensors="pt",
).to(self.device)
with torch.inference_mode():
logits = model(**encoded).logits
probabilities = torch.softmax(logits, dim=-1)
confidences, label_ids = torch.max(probabilities, dim=-1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be made into a CompositeStage with a TokenizerStage (should be able to just import from https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/text/models/tokenizer.py) and a GPU-based model stage.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumping this. Each of tokenization and model inference should be its own stage. The reason we do this is so that we maximize GPU usage at any given moment. If CPU tokenization and GPU inference are in the same stage, then the GPU sits idle while waiting for tokenization happens. Let me know if I can help more with the refactor here.

Comment thread nemo_curator/stages/text/download/html_extractors/model_based.py
model_identifier: str = "opendatalab/MinerU-HTML-0.6B",
output_format: ModelBasedOutputFormat = "markdown",
fallback_threshold: float = 0.65,
device: Literal["cuda", "cpu"] = "cuda",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use Resources instead.

@zeel2104

Copy link
Copy Markdown
Author

@sarahyurick
Thanks, Sarah. I addressed the smaller follow-ups in the latest push:

  • moved torch / transformers imports to top-level
  • changed the CUDA-unavailable path to raise instead of silently falling back
  • fixed the plain-text fenced-code edge case and added regression coverage

For the larger comments around setup_on_node, Resources, and using a CompositeStage with the tokenizer/model stage pattern: I dug into the current text download path, and HTMLExtractorAlgorithm / DocumentExtractor here are not ProcessingStages, so they don’t currently participate in executor lifecycle hooks the way the model/tokenizer stages do. I can work on refactoring this toward that pattern, but it seems like a broader architectural change than a small follow-up to this PR. Would you prefer that larger refactor in this PR?

Comment on lines +112 to +132
def _setup(self) -> None:
if self._model is not None and self._tokenizer is not None:
return

self._tokenizer = AutoTokenizer.from_pretrained(
self.model_identifier,
cache_dir=self.cache_dir,
local_files_only=self.local_files_only,
**self.transformers_init_kwargs,
)
self._model = AutoModelForSequenceClassification.from_pretrained(
self.model_identifier,
cache_dir=self.cache_dir,
local_files_only=self.local_files_only,
**self.transformers_init_kwargs,
)
if self.device == "cuda" and not torch.cuda.is_available():
msg = "CUDA requested for model-based HTML extraction, but CUDA is unavailable."
raise RuntimeError(msg)
self._model.to(self.device)
self._model.eval()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 CUDA check after model load leaves object in broken state

The CUDA availability check at line 128 runs after both self._tokenizer and self._model have already been assigned. When CUDA is unavailable the RuntimeError is raised, but at that point self._model is non-None (a CPU-loaded model). On the very next predict() call, _setup() sees both attributes non-None and returns early — the model is never moved to the intended device — then encoded.to(self.device) fails with a generic "No CUDA GPUs available" error instead of the helpful message, permanently.

Move the check to the top of _setup(), before any loading:

def _setup(self) -> None:
    if self._model is not None and self._tokenizer is not None:
        return
    if self.device == "cuda" and not torch.cuda.is_available():
        msg = "CUDA requested for model-based HTML extraction, but CUDA is unavailable."
        raise RuntimeError(msg)

    self._tokenizer = AutoTokenizer.from_pretrained(
        self.model_identifier,
        cache_dir=self.cache_dir,
        local_files_only=self.local_files_only,
        **self.transformers_init_kwargs,
    )
    self._model = AutoModelForSequenceClassification.from_pretrained(
        self.model_identifier,
        cache_dir=self.cache_dir,
        local_files_only=self.local_files_only,
        **self.transformers_init_kwargs,
    )
    self._model.to(self.device)
    self._model.eval()

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I moved the CUDA availability check to the start of _setup() so we fail before assigning tokenizer/model state.

@sarahyurick

Copy link
Copy Markdown
Contributor

Thanks @zeel2104 ! Yes, I would prefer the larger changes to be in this PR too.

@zeel2104

Copy link
Copy Markdown
Author

Thanks, @sarahyurick . I updated the PR to align this path more closely with Curator’s stage lifecycle pattern.

Changes in the latest push:

  • added lifecycle/resource delegation through the document extraction path so the HTML extractor can participate in setup/setup_on_node/teardown
  • added setup_on_node() for model prefetch via snapshot_download
  • moved model initialization into setup() and removed the implicit setup call from predict()
  • added resource and Ray stage hints for the model-based extractor
  • kept the smaller follow-ups from before as well (top-level imports, explicit CUDA failure, plain-text fenced-code fix)

I also added coverage for the lifecycle/resource delegation path and reran:

  • uv run pytest tests/stages/text/download/test_model_based_html_extractor.py tests/stages/text/download/common_crawl/test_extract.py -q
  • uv run ruff check nemo_curator/stages/text/download/base/extract.py nemo_curator/stages/text/download/base/iterator.py nemo_curator/stages/text/download/common_crawl/extract.py nemo_curator/stages/text/download/html_extractors/model_based.py tests/stages/text/download/test_model_based_html_extractor.py

Let me know if any changes required

@sarahyurick sarahyurick left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @zeel2104 thank you for the quick updates! I have left more comments. I think there is still some refactoring needed, let me know what you think.

confidence: float


class HTMLElementClassifier(Protocol):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering why is this a Protocol?

model_identifier: str,
cache_dir: str | None,
device: Literal["cuda", "cpu"],
batch_size: int,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be renamed to what we use for other model-based stages:

Suggested change
batch_size: int,
model_inference_batch_size: int,

Comment on lines +143 to +145
if self._model is None or self._tokenizer is None:
msg = "Model-based HTML classifier was not initialized. Call setup() before inference."
raise RuntimeError(msg)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should not have this. The Pipeline should always called setup.

Comment on lines +148 to +161
for start in range(0, len(elements), self.batch_size):
batch = elements[start : start + self.batch_size]
model_inputs = [self._format_element(element) for element in batch]
encoded = tokenizer(
model_inputs,
padding=True,
truncation=True,
max_length=self.max_length,
return_tensors="pt",
).to(self.device)
with torch.inference_mode():
logits = model(**encoded).logits
probabilities = torch.softmax(logits, dim=-1)
confidences, label_ids = torch.max(probabilities, dim=-1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumping this. Each of tokenization and model inference should be its own stage. The reason we do this is so that we maximize GPU usage at any given moment. If CPU tokenization and GPU inference are in the same stage, then the GPU sits idle while waiting for tokenization happens. Let me know if I can help more with the refactor here.

@zeel2104

Copy link
Copy Markdown
Author

@sarahyurick
Thanks, Sarah. I agree the remaining feedback is now pointing at a larger refactor rather than incremental cleanup.

My current change threads lifecycle/resource hooks through the existing HTMLExtractorAlgorithm / document extraction path, but it still keeps tokenization and model inference inside the same classifier flow. After looking again at the existing Curator tokenizer/model stages, I agree that this does not fully match the intended CPU-tokenization / GPU-inference split.

I think the right next step is to refactor this so that:

  • tokenization and model inference are separate stages
  • the batch size is renamed to model_inference_batch_size
  • the protocol/implicit classifier abstraction is simplified to better match the concrete pipeline structure

I’ll work on that refactor

@zeel2104

Copy link
Copy Markdown
Author

@sarahyurick I pushed a larger refactor for the model-based Common Crawl path.

The main change is that the pipeline path for html_extraction="model" / "model_based" now decomposes into separate stages rather than keeping tokenization and inference inside a single extractor flow:

  • candidate extraction from HTML
  • tokenizer stage
  • GPU model inference stage
  • final assembly stage

I also renamed the model batch parameter to model_inference_batch_size, replaced the protocol-style classifier abstraction with a concrete base class, and kept the direct extractor path working for compatibility/tests.

Let me know if this works for you

@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Apr 17, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci added waiting-on-maintainers Waiting on maintainers to respond and removed needs-follow-up Issue needs follow-up labels Apr 21, 2026

@sarahyurick sarahyurick left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @zeel2104 sorry for the delay in getting back to you here. I left some comments, my main confusion and concerns are around some of the classes in model_based.py. Let me know what you think.

fallback_extractor=algorithm.fallback_extractor,
filename_column=filename_column,
)
return [base_stage.decompose()[0], base_stage.decompose()[1], iterate_stage, tokenizer_stage, inference_stage, assemble_stage]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic looks okay, but I was a bit confused at first by base_stage.decompose()[0], base_stage.decompose()[1], iterate_stage. It seems like it could be base_stage.decompose()[0], base_stage.decompose()[1], base_stage.decompose()[2] with some type of check to make sure that base_stage is only 3 stages? Or alternatively, it could be url_generation_stage, download_stage, iterate_stage.

Some checks that the stages are indeed BaseCommonCrawlUrlGenerator, CommonCrawlWARCDownloader, and CommonCrawlWarcIterator would probably be good too.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the file should just be called model.py, WDYT?

self._model: Any | None = None
self._tokenizer: Any | None = None

def _setup(self, local_files_only: bool = True) -> None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can _setup_on_node be used to download the tokenizer and model files (e.g., with hf_hub_download), then _setup will just load them?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, reading through the rest of the script, will ask more questions below.

Comment on lines +199 to +200
if self._model is None or self._tokenizer is None:
self._setup(local_files_only=self.local_files_only)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed, we want to make sure setup is always being called by the pipeline.

model_identifier: str = "opendatalab/MinerU-HTML-0.6B",
output_format: ModelBasedOutputFormat = "markdown",
fallback_threshold: float = 0.65,
device: Literal["cuda"] = "cuda",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should either support Literal["cpu, cuda"] or be removed and always use cuda.

for start in range(0, len(elements), self.model_inference_batch_size):
batch = elements[start : start + self.model_inference_batch_size]
model_inputs = [format_html_element_for_model(element) for element in batch]
encoded = tokenizer(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it still tokenize here if a TokenizerStage is already being used in stage.py?


return rendered_blocks

def _get_classifier(self) -> HTMLElementClassifier:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this function should be needed right? Like setup can just set self.classifier and then that's all.

confidence: float


class HTMLElementClassifier(ABC):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep this class if _TransformersHTMLElementClassifier is the only one?

Comment on lines +398 to +404
rendered_blocks = [ModelBasedHTMLExtractionStage._render_element(element, label) for element, label in accepted]
rendered_blocks = [block for block in rendered_blocks if block]
if not rendered_blocks:
return None

if self.output_format in {"plain", "plain_text"}:
rendered_blocks = [ModelBasedHTMLExtractionStage._markdown_block_to_plain_text(block) for block in rendered_blocks]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be problematic to have ModelBasedHTMLExtractionStage hardcoded here? Like couldn't it not necessarily match the algorithm being used by CommonCrawlModelBasedCandidateExtractor?

return {RayStageSpecKeys.IS_ACTOR_STAGE: True}

@staticmethod
def _extract_candidate_elements(soup: BeautifulSoup) -> list[HTMLElement]:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am misunderstanding, there seems to be some circular and/or dead logic here. CommonCrawlModelBasedCandidateExtractor.extract calls ModelBasedHTMLExtractionStage._extract_candidate_elements, and ModelBasedHTMLExtractionStage.extract_text calls ModelBasedHTMLExtractionStage._extract_candidate_elements and ModelBasedHTMLExtractionStage._select_and_render_blocks. I guess I am confused where ModelBasedHTMLExtractionStage.extract_text is being used? It looks like nowhere but I could be missing something.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added waiting-on-customer Waiting on the original author to respond and removed waiting-on-maintainers Waiting on maintainers to respond labels May 8, 2026
@zeel2104

Copy link
Copy Markdown
Author

@sarahyurick
Thanks, Sarah. I pushed another simplification pass.

Main changes:

  • made the Common Crawl model-based stage construction explicit (url_generation_stage, download_stage, iterate_stage) and added checks around the expected stage types/components
  • moved the shared HTML candidate extraction and rendering logic to module-level helpers instead of routing through ModelBasedHTMLExtractionStage statics
  • removed the extra classifier abstraction / helper path and simplified ModelBasedHTMLExtractionStage into a thinner direct-use wrapper
  • kept the actual Common Crawl pipeline path as separate candidate-extraction, tokenizer, inference, and assembly stages

@zeel2104 zeel2104 force-pushed the zeel/model-based-html-extraction branch from 764ca28 to f7b3cfb Compare May 11, 2026 14:19
@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label May 11, 2026
Comment on lines +551 to +578
model = self._model
tokenizer = self._tokenizer
predictions: list[HTMLElementPrediction] = []
for start in range(0, len(elements), self.model_inference_batch_size):
batch = elements[start : start + self.model_inference_batch_size]
model_inputs = [format_html_element_for_model(element) for element in batch]
encoded = tokenizer(
model_inputs,
padding=True,
truncation=True,
max_length=self.max_length,
return_tensors="pt",
).to(self.device)
with torch.inference_mode():
logits = model(**encoded).logits
probabilities = torch.softmax(logits, dim=-1)
confidences, label_ids = torch.max(probabilities, dim=-1)

id2label = getattr(model.config, "id2label", {})
for label_id, confidence in zip(label_ids.cpu().tolist(), confidences.cpu().tolist(), strict=True):
predictions.append(
HTMLElementPrediction(
label=str(id2label.get(label_id, label_id)).lower(),
confidence=float(confidence),
)
)

return predictions

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as before, is there a reason we aren't able to break this into a CPU-only tokenizer stage and a GPU-based model inference stage?

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label May 12, 2026
Signed-off-by: Zeel <desaizeel2128@gmail.com>
@zeel2104 zeel2104 force-pushed the zeel/model-based-html-extraction branch from f7b3cfb to b627347 Compare May 21, 2026 01:38
@zeel2104

Copy link
Copy Markdown
Author

@sarahyurick
Thanks, Sarah. I made another cleanup pass to address that point directly.

The remaining inline tokenization/inference path has been removed from ModelBasedHTMLExtractionStage, so the model-based Common Crawl flow now only exists through the staged pipeline:

  • candidate extraction
  • tokenizer stage
  • GPU inference stage
  • assembly stage

I also made ModelBasedHTMLExtractionStage a configuration wrapper for that staged path, and updated CommonCrawlHTMLExtractor to reject algorithm="model" directly so we don’t leave a misleading partially-supported path around.

Comment on lines +99 to +104

if html is not None:
# Language detection and HTML extraction
lang = lang_detect(html)

text = None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 algorithm="model" entry point is unreachable from CommonCrawlDownloadExtractStage

The error message tells users to call CommonCrawlDownloadExtractStage(html_extraction='model'), but stage.py was not updated in this PR — it still passes html_extraction directly to CommonCrawlHTMLExtractor.__init__, where it hits this same ValueError. Any call to CommonCrawlDownloadExtractStage(html_extraction="model") will fail at construction with a self-referential error message. The stage.py file needs to detect html_extraction in {"model", "model_based"} and compose the new CommonCrawlModelBasedCandidateExtractorTokenizerStageModelBasedHTMLInferenceStageAssembleModelBasedHTMLExtractionStage pipeline instead of delegating to CommonCrawlHTMLExtractor.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I think the PR is not usable as is.

fyi the PR #2075 is going to take over this work I think, thank you!

CANDIDATE_TEXT_FIELD,
CANDIDATE_HTML_FIELD,
CANDIDATE_ATTRIBUTES_FIELD,
MODEL_INPUT_FIELD,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Return type incompatible with DocumentExtractor base class

CommonCrawlModelBasedCandidateExtractor.extract returns list[dict[str, Any]] | None, but DocumentExtractor.extract is declared dict[str, Any] | None. When this extractor is eventually plugged into DocumentIterateExtractStage, the stage calls extracted[self.filename_col] = record_dict[self.filename_col] — a string-keyed assignment on what is a list, raising a TypeError. The interface contract needs to be resolved: either DocumentIterateExtractStage must be taught to flatten list returns, or CommonCrawlModelBasedCandidateExtractor should not inherit DocumentExtractor and instead be used only in the custom multi-stage pipeline path.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added waiting-on-maintainers Waiting on maintainers to respond and removed waiting-on-customer Waiting on the original author to respond labels May 21, 2026
@sarahyurick sarahyurick self-requested a review June 11, 2026 20:57
@svcnvidia-nemo-ci svcnvidia-nemo-ci added waiting-on-customer Waiting on the original author to respond and removed waiting-on-maintainers Waiting on maintainers to respond labels Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request waiting-on-customer Waiting on the original author to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Model-Based HTML Extraction

4 participants