Add model-based HTML extraction stage by zeel2104 · Pull Request #1768 · NVIDIA-NeMo/Curator

zeel2104 · 2026-04-08T17:14:57Z

Description

Adds a model-based HTML extraction algorithm for Common Crawl extraction.

This introduces ModelBasedHTMLExtractionStage, which classifies candidate HTML elements, preserves structured content such as fenced code blocks, math formulas, and Markdown tables, and falls back to Trafilatura when model confidence is low. It is wired into CommonCrawlHTMLExtractor via algorithm="model" and algorithm="model_based".

Closes #1723.

Usage

from nemo_curator.stages.text.download.common_crawl.extract import CommonCrawlHTMLExtractor

html_extractor = CommonCrawlHTMLExtractor(
    algorithm="model",
    algorithm_kwargs={
        "model_identifier": "opendatalab/MinerU-HTML-0.6B",
        "output_format": "markdown",
        "fallback_threshold": 0.65,
        "device": "cuda",
        "batch_size": 64,
    },
)

## Checklist
<!--
Note: All commits need to be signed and signed off. This can be done via `-sS` flags while commiting
`git commit -sS -m "...."
-->
- [ x] I am familiar with the [Contributing Guide](https://github.com/NVIDIA-NeMo/Curator/blob/main/CONTRIBUTING.md).
- [ x] New or Existing tests cover these changes.
- [ ] The documentation is up to date with these changes.

## Tests

uv run pytest tests/stages/text/download/test_model_based_html_extractor.py -q
uv run ruff check nemo_curator/stages/text/download/html_extractors/model_based.py tests/stages/text/download/test_model_based_html_extractor.py nemo_curator/stages/text/download/common_crawl/extract.py

copy-pr-bot · 2026-04-08T17:15:01Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-04-08T17:18:25Z

Greptile Summary

This PR adds a ModelBasedHTMLExtractionStage that classifies HTML candidate elements using a Hugging Face sequence classifier, preserves structured content (code blocks, math, tables) as Markdown, and falls back to Trafilatura when confidence is low. The building blocks — candidate extraction, inference, and assembly stages — are well-implemented, but the integration with the pipeline entry point is incomplete.

CommonCrawlDownloadExtractStage(html_extraction=\"model\") raises a ValueError at construction, because stage.py was not updated to route model-based requests through the new multi-stage pipeline; the error message incorrectly directs users to the same broken path.
CommonCrawlModelBasedCandidateExtractor.extract returns list[dict] while the DocumentExtractor base class declares dict | None, which would cause a TypeError in DocumentIterateExtractStage when the extractor is eventually wired in.

Confidence Score: 3/5

The new model-based stages are architecturally sound but the end-to-end feature is non-functional — the pipeline entry point raises an error and the candidate extractor has a type contract violation that would crash the iterator stage when wired up.

Two defects affect the core advertised feature: CommonCrawlDownloadExtractStage(html_extraction='model') fails at construction with a self-referential error (stage.py was not updated), and CommonCrawlModelBasedCandidateExtractor.extract returns a list where the base class and DocumentIterateExtractStage expect a dict. Both must be resolved before the feature is usable.

nemo_curator/stages/text/download/common_crawl/extract.py and nemo_curator/stages/text/download/common_crawl/stage.py need attention — the wiring between the new multi-stage pipeline and the existing CommonCrawlDownloadExtractStage is absent.

Important Files Changed

Filename	Overview
nemo_curator/stages/text/download/common_crawl/extract.py	Adds CommonCrawlModelBasedCandidateExtractor and model/model_based algorithm detection, but the ValueError for those strings is self-referential (stage.py still routes through this class), making the feature's pipeline entry point unreachable; also CommonCrawlModelBasedCandidateExtractor.extract returns list[dict] while the base class declares dict
nemo_curator/stages/text/download/html_extractors/model_based.py	New file implementing element extraction, classification, and assembly stages with code/formula/table rendering; logic is sound but AssembleModelBasedHTMLExtractionStage uses full HTML content as a groupby key unnecessarily.
nemo_curator/stages/text/download/html_extractors/init.py	Exports new model-based types from the package's public API; no issues.
tests/stages/text/download/test_model_based_html_extractor.py	Tests cover helper functions, rendering, lifecycle delegation, and the ValueError path; no test exercises CommonCrawlDownloadExtractStage(html_extraction='model') which would have caught the broken wiring.

Sequence Diagram

sequenceDiagram
    participant U as User
    participant CCDES as CommonCrawlDownloadExtractStage
    participant CCMCE as CommonCrawlModelBasedCandidateExtractor
    participant TS as TokenizerStage
    participant MBHIS as ModelBasedHTMLInferenceStage
    participant AMBHES as AssembleModelBasedHTMLExtractionStage

    note over CCDES,AMBHES: Intended model-based pipeline (NOT yet wired in stage.py)

    U->>CCDES: "html_extraction=model"
    CCDES-->>U: ValueError (self-referential error)

    note over CCDES,AMBHES: Correct flow once stage.py is updated

    CCDES->>CCMCE: extract HTML candidates per WARC record
    CCMCE->>CCMCE: extract_candidate_elements(BeautifulSoup)
    CCMCE-->>CCDES: list of candidate rows (1 row per HTML element)

    CCDES->>TS: tokenize MODEL_INPUT_FIELD column
    TS-->>CCDES: input_ids + attention_mask columns

    CCDES->>MBHIS: classify elements (batched GPU inference)
    MBHIS-->>CCDES: candidate_label + candidate_confidence columns

    CCDES->>AMBHES: assemble per-document text
    AMBHES->>AMBHES: group by (url, warc_id, source_id, language)
    AMBHES->>AMBHES: render accepted elements to markdown/plain
    alt "mean_confidence >= fallback_threshold"
        AMBHES-->>CCDES: text from model predictions
    else low confidence
        AMBHES->>AMBHES: TrafilaturaExtractor.extract_text (fallback)
        AMBHES-->>CCDES: text from fallback
    end

_{Reviews (9): Last reviewed commit: "Remove inline model execution from HTML ..." | Re-trigger Greptile}

greptile-apps · 2026-04-08T17:18:30Z

+        classifier: HTMLElementClassifier | None = None,
+        fallback_extractor: HTMLExtractorAlgorithm | None = None,
+        transformers_init_kwargs: dict[str, Any] | None = None,
+    ):


local_files_only=True blocks model download out of the box

The default local_files_only=True means AutoTokenizer.from_pretrained and AutoModelForSequenceClassification.from_pretrained will refuse to hit the Hugging Face Hub and will raise an OSError (e.g., "Can't load tokenizer for 'opendatalab/MinerU-HTML-0.6B'") unless the model is already in the local cache. Since the default model_identifier is a public Hub model and the README usage example omits this parameter, any first-time user who follows the example will get an immediate failure with a confusing error message. The default should be False so the model is downloaded automatically.

Suggested change

):

local_files_only: bool = False,

+1 local_files_only=False should be used in setup_on_node and local_files_only=True should be used in setup.

greptile-apps · 2026-04-08T17:18:33Z

+            for line in block.splitlines():
+                cells = [cell.strip() for cell in line.strip("|").split("|")]
+                if cells and not all(cell == "---" for cell in cells):
+                    rows.append("\t".join(cells))


Aligned separator rows not filtered in plain-text table conversion

The separator filter checks all(cell == "---" for cell in cells), but GFM alignment indicators (":---", "---:", ":---:") are also valid separator row values. A table rendered with column alignment will have separator rows like | :--- | ---: |, which won't match "---" exactly and will be emitted as a data row in plain-text output. Consider using cell.strip(":") == "---" or a regex like r":?-{3,}:?" for the check.

greptile-apps · 2026-04-08T17:18:35Z

+        fence = "```"
+        if fence in code:
+            fence = "````"
+        return f"{fence}{language}\n{code.strip()}\n{fence}"


Fence escalation only covers one level

If the code content contains four consecutive backticks, neither the default ``` fence nor the escalated ```` fence will correctly delimit the block, producing broken Markdown. A more robust approach finds the longest run of backticks and uses one more:

import re max_run = max((len(m.group()) for m in re.finditer(r"`+", code)), default=0) fence = "`" * max(3, max_run + 1)

Signed-off-by: Zeel <desaizeel2128@gmail.com>

zeel2104 · 2026-04-08T20:18:23Z

Fixed: (after comments by greptile-bot)

changed local_files_only default to False so first-time use can download the model
made code fence generation robust to arbitrary backtick runs
filtered GFM alignment separator rows in plain-text table conversion
added test coverage for the plain-text table output path

greptile-apps · 2026-04-08T20:18:24Z

Tip:

Greploops — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

sarahyurick

Thank you @zeel2104 ! I left a few high-level comments about common Curator patterns and suggestions for changes. Please let me know if you have any questions.

sarahyurick · 2026-04-13T17:55:40Z

+        self,
+        model_identifier: str,
+        cache_dir: str | None,
+        device: Literal["cuda", "cpu"],


Maybe it should be GPU only.

sarahyurick · 2026-04-13T17:58:04Z

+        self._model: Any | None = None
+        self._tokenizer: Any | None = None
+
+    def _setup(self) -> None:


There should be a setup_on_node function which uses huggingface_hub's snapshot_download function to make sure we only download the model once.

Then, when the model is loaded with setup, it can do local_files_only=True.

sarahyurick · 2026-04-13T17:58:13Z

+        import torch
+        from transformers import AutoModelForSequenceClassification, AutoTokenizer


These can be top-level imports.

sarahyurick · 2026-04-13T17:58:41Z

+        if self.device == "cuda" and not torch.cuda.is_available():
+            logger.warning("CUDA requested for model-based HTML extraction, but CUDA is unavailable. Using CPU.")
+            self.device = "cpu"


It should probably error in this case instead of warning.

sarahyurick · 2026-04-13T18:00:31Z

+        self._model.eval()
+
+    def predict(self, elements: list[HTMLElement]) -> list[HTMLElementPrediction]:
+        self._setup()


The setup function should never be called directly by predict or any other function within the stage. When running it as part as a Pipeline, the executor will call it.

sarahyurick · 2026-04-13T18:00:41Z

+    def predict(self, elements: list[HTMLElement]) -> list[HTMLElementPrediction]:
+        self._setup()
+
+        import torch


This can be a top-level import.

sarahyurick · 2026-04-13T18:01:56Z

+        for start in range(0, len(elements), self.batch_size):
+            batch = elements[start : start + self.batch_size]
+            model_inputs = [self._format_element(element) for element in batch]
+            encoded = tokenizer(
+                model_inputs,
+                padding=True,
+                truncation=True,
+                max_length=self.max_length,
+                return_tensors="pt",
+            ).to(self.device)
+            with torch.inference_mode():
+                logits = model(**encoded).logits
+                probabilities = torch.softmax(logits, dim=-1)
+                confidences, label_ids = torch.max(probabilities, dim=-1)


This can be made into a CompositeStage with a TokenizerStage (should be able to just import from https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/text/models/tokenizer.py) and a GPU-based model stage.

Bumping this. Each of tokenization and model inference should be its own stage. The reason we do this is so that we maximize GPU usage at any given moment. If CPU tokenization and GPU inference are in the same stage, then the GPU sits idle while waiting for tokenization happens. Let me know if I can help more with the refactor here.

sarahyurick · 2026-04-13T18:03:21Z

+        model_identifier: str = "opendatalab/MinerU-HTML-0.6B",
+        output_format: ModelBasedOutputFormat = "markdown",
+        fallback_threshold: float = 0.65,
+        device: Literal["cuda", "cpu"] = "cuda",


Should use Resources instead.

zeel2104 · 2026-04-13T19:49:11Z

@sarahyurick
Thanks, Sarah. I addressed the smaller follow-ups in the latest push:

moved torch / transformers imports to top-level
changed the CUDA-unavailable path to raise instead of silently falling back
fixed the plain-text fenced-code edge case and added regression coverage

For the larger comments around setup_on_node, Resources, and using a CompositeStage with the tokenizer/model stage pattern: I dug into the current text download path, and HTMLExtractorAlgorithm / DocumentExtractor here are not ProcessingStages, so they don’t currently participate in executor lifecycle hooks the way the model/tokenizer stages do. I can work on refactoring this toward that pattern, but it seems like a broader architectural change than a small follow-up to this PR. Would you prefer that larger refactor in this PR?

greptile-apps · 2026-04-13T19:50:56Z

+    def _setup(self) -> None:
+        if self._model is not None and self._tokenizer is not None:
+            return
+
+        self._tokenizer = AutoTokenizer.from_pretrained(
+            self.model_identifier,
+            cache_dir=self.cache_dir,
+            local_files_only=self.local_files_only,
+            **self.transformers_init_kwargs,
+        )
+        self._model = AutoModelForSequenceClassification.from_pretrained(
+            self.model_identifier,
+            cache_dir=self.cache_dir,
+            local_files_only=self.local_files_only,
+            **self.transformers_init_kwargs,
+        )
+        if self.device == "cuda" and not torch.cuda.is_available():
+            msg = "CUDA requested for model-based HTML extraction, but CUDA is unavailable."
+            raise RuntimeError(msg)
+        self._model.to(self.device)
+        self._model.eval()


CUDA check after model load leaves object in broken state

The CUDA availability check at line 128 runs after both self._tokenizer and self._model have already been assigned. When CUDA is unavailable the RuntimeError is raised, but at that point self._model is non-None (a CPU-loaded model). On the very next predict() call, _setup() sees both attributes non-None and returns early — the model is never moved to the intended device — then encoded.to(self.device) fails with a generic "No CUDA GPUs available" error instead of the helpful message, permanently.

Move the check to the top of _setup(), before any loading:

def _setup(self) -> None: if self._model is not None and self._tokenizer is not None: return if self.device == "cuda" and not torch.cuda.is_available(): msg = "CUDA requested for model-based HTML extraction, but CUDA is unavailable." raise RuntimeError(msg) self._tokenizer = AutoTokenizer.from_pretrained( self.model_identifier, cache_dir=self.cache_dir, local_files_only=self.local_files_only, **self.transformers_init_kwargs, ) self._model = AutoModelForSequenceClassification.from_pretrained( self.model_identifier, cache_dir=self.cache_dir, local_files_only=self.local_files_only, **self.transformers_init_kwargs, ) self._model.to(self.device) self._model.eval()

Fixed. I moved the CUDA availability check to the start of _setup() so we fail before assigning tokenizer/model state.

sarahyurick · 2026-04-13T20:10:28Z

Thanks @zeel2104 ! Yes, I would prefer the larger changes to be in this PR too.

zeel2104 · 2026-04-14T00:44:53Z

Thanks, @sarahyurick . I updated the PR to align this path more closely with Curator’s stage lifecycle pattern.

Changes in the latest push:

added lifecycle/resource delegation through the document extraction path so the HTML extractor can participate in setup/setup_on_node/teardown
added setup_on_node() for model prefetch via snapshot_download
moved model initialization into setup() and removed the implicit setup call from predict()
added resource and Ray stage hints for the model-based extractor
kept the smaller follow-ups from before as well (top-level imports, explicit CUDA failure, plain-text fenced-code fix)

I also added coverage for the lifecycle/resource delegation path and reran:

uv run pytest tests/stages/text/download/test_model_based_html_extractor.py tests/stages/text/download/common_crawl/test_extract.py -q
uv run ruff check nemo_curator/stages/text/download/base/extract.py nemo_curator/stages/text/download/base/iterator.py nemo_curator/stages/text/download/common_crawl/extract.py nemo_curator/stages/text/download/html_extractors/model_based.py tests/stages/text/download/test_model_based_html_extractor.py

Let me know if any changes required

sarahyurick

Hi @zeel2104 thank you for the quick updates! I have left more comments. I think there is still some refactoring needed, let me know what you think.

sarahyurick · 2026-04-14T17:28:39Z

+    confidence: float
+
+
+class HTMLElementClassifier(Protocol):


I was wondering why is this a Protocol?

sarahyurick · 2026-04-14T17:29:34Z

+        model_identifier: str,
+        cache_dir: str | None,
+        device: Literal["cuda", "cpu"],
+        batch_size: int,


This can be renamed to what we use for other model-based stages:

Suggested change

batch_size: int,

model_inference_batch_size: int,

sarahyurick · 2026-04-14T17:30:48Z

+        if self._model is None or self._tokenizer is None:
+            msg = "Model-based HTML classifier was not initialized. Call setup() before inference."
+            raise RuntimeError(msg)


We probably should not have this. The Pipeline should always called setup.

sarahyurick · 2026-04-14T17:32:20Z

+        for start in range(0, len(elements), self.batch_size):
+            batch = elements[start : start + self.batch_size]
+            model_inputs = [self._format_element(element) for element in batch]
+            encoded = tokenizer(
+                model_inputs,
+                padding=True,
+                truncation=True,
+                max_length=self.max_length,
+                return_tensors="pt",
+            ).to(self.device)
+            with torch.inference_mode():
+                logits = model(**encoded).logits
+                probabilities = torch.softmax(logits, dim=-1)
+                confidences, label_ids = torch.max(probabilities, dim=-1)


Bumping this. Each of tokenization and model inference should be its own stage. The reason we do this is so that we maximize GPU usage at any given moment. If CPU tokenization and GPU inference are in the same stage, then the GPU sits idle while waiting for tokenization happens. Let me know if I can help more with the refactor here.

zeel2104 · 2026-04-14T22:13:54Z

@sarahyurick
Thanks, Sarah. I agree the remaining feedback is now pointing at a larger refactor rather than incremental cleanup.

My current change threads lifecycle/resource hooks through the existing HTMLExtractorAlgorithm / document extraction path, but it still keeps tokenization and model inference inside the same classifier flow. After looking again at the existing Curator tokenizer/model stages, I agree that this does not fully match the intended CPU-tokenization / GPU-inference split.

I think the right next step is to refactor this so that:

tokenization and model inference are separate stages
the batch size is renamed to model_inference_batch_size
the protocol/implicit classifier abstraction is simplified to better match the concrete pipeline structure

I’ll work on that refactor

zeel2104 · 2026-04-15T13:39:17Z

@sarahyurick I pushed a larger refactor for the model-based Common Crawl path.

The main change is that the pipeline path for html_extraction="model" / "model_based" now decomposes into separate stages rather than keeping tokenization and inference inside a single extractor flow:

candidate extraction from HTML
tokenizer stage
GPU model inference stage
final assembly stage

I also renamed the model batch parameter to model_inference_batch_size, replaced the protocol-style classifier abstraction with a concrete base class, and kept the direct extractor path working for compatibility/tests.

Let me know if this works for you

sarahyurick

Hi @zeel2104 sorry for the delay in getting back to you here. I left some comments, my main confusion and concerns are around some of the classes in model_based.py. Let me know what you think.

sarahyurick · 2026-05-08T18:30:34Z

+            fallback_extractor=algorithm.fallback_extractor,
+            filename_column=filename_column,
+        )
+        return [base_stage.decompose()[0], base_stage.decompose()[1], iterate_stage, tokenizer_stage, inference_stage, assemble_stage]


This logic looks okay, but I was a bit confused at first by base_stage.decompose()[0], base_stage.decompose()[1], iterate_stage. It seems like it could be base_stage.decompose()[0], base_stage.decompose()[1], base_stage.decompose()[2] with some type of check to make sure that base_stage is only 3 stages? Or alternatively, it could be url_generation_stage, download_stage, iterate_stage.

Some checks that the stages are indeed BaseCommonCrawlUrlGenerator, CommonCrawlWARCDownloader, and CommonCrawlWarcIterator would probably be good too.

sarahyurick · 2026-05-08T18:30:56Z

Maybe the file should just be called model.py, WDYT?

sarahyurick · 2026-05-08T18:32:44Z

+        self._model: Any | None = None
+        self._tokenizer: Any | None = None
+
+    def _setup(self, local_files_only: bool = True) -> None:


Can _setup_on_node be used to download the tokenizer and model files (e.g., with hf_hub_download), then _setup will just load them?

Hmm, reading through the rest of the script, will ask more questions below.

sarahyurick · 2026-05-08T18:33:30Z

+        if self._model is None or self._tokenizer is None:
+            self._setup(local_files_only=self.local_files_only)


This can be removed, we want to make sure setup is always being called by the pipeline.

sarahyurick · 2026-05-08T18:35:53Z

+        model_identifier: str = "opendatalab/MinerU-HTML-0.6B",
+        output_format: ModelBasedOutputFormat = "markdown",
+        fallback_threshold: float = 0.65,
+        device: Literal["cuda"] = "cuda",


This should either support Literal["cpu, cuda"] or be removed and always use cuda.

sarahyurick · 2026-05-08T18:42:54Z

+        for start in range(0, len(elements), self.model_inference_batch_size):
+            batch = elements[start : start + self.model_inference_batch_size]
+            model_inputs = [format_html_element_for_model(element) for element in batch]
+            encoded = tokenizer(


Why does it still tokenize here if a TokenizerStage is already being used in stage.py?

sarahyurick · 2026-05-08T18:47:29Z

+
+        return rendered_blocks
+
+    def _get_classifier(self) -> HTMLElementClassifier:


I don't think this function should be needed right? Like setup can just set self.classifier and then that's all.

sarahyurick · 2026-05-08T18:50:33Z

+    confidence: float
+
+
+class HTMLElementClassifier(ABC):


Do we need to keep this class if _TransformersHTMLElementClassifier is the only one?

sarahyurick · 2026-05-08T19:09:22Z

+        rendered_blocks = [ModelBasedHTMLExtractionStage._render_element(element, label) for element, label in accepted]
+        rendered_blocks = [block for block in rendered_blocks if block]
+        if not rendered_blocks:
+            return None
+
+        if self.output_format in {"plain", "plain_text"}:
+            rendered_blocks = [ModelBasedHTMLExtractionStage._markdown_block_to_plain_text(block) for block in rendered_blocks]


Could it be problematic to have ModelBasedHTMLExtractionStage hardcoded here? Like couldn't it not necessarily match the algorithm being used by CommonCrawlModelBasedCandidateExtractor?

sarahyurick · 2026-05-08T19:19:01Z

+        return {RayStageSpecKeys.IS_ACTOR_STAGE: True}
+
+    @staticmethod
+    def _extract_candidate_elements(soup: BeautifulSoup) -> list[HTMLElement]:


Maybe I am misunderstanding, there seems to be some circular and/or dead logic here. CommonCrawlModelBasedCandidateExtractor.extract calls ModelBasedHTMLExtractionStage._extract_candidate_elements, and ModelBasedHTMLExtractionStage.extract_text calls ModelBasedHTMLExtractionStage._extract_candidate_elements and ModelBasedHTMLExtractionStage._select_and_render_blocks. I guess I am confused where ModelBasedHTMLExtractionStage.extract_text is being used? It looks like nowhere but I could be missing something.

zeel2104 · 2026-05-11T13:35:37Z

@sarahyurick
Thanks, Sarah. I pushed another simplification pass.

Main changes:

made the Common Crawl model-based stage construction explicit (url_generation_stage, download_stage, iterate_stage) and added checks around the expected stage types/components
moved the shared HTML candidate extraction and rendering logic to module-level helpers instead of routing through ModelBasedHTMLExtractionStage statics
removed the extra classifier abstraction / helper path and simplified ModelBasedHTMLExtractionStage into a thinner direct-use wrapper
kept the actual Common Crawl pipeline path as separate candidate-extraction, tokenizer, inference, and assembly stages

sarahyurick · 2026-05-11T21:25:08Z

+        model = self._model
+        tokenizer = self._tokenizer
+        predictions: list[HTMLElementPrediction] = []
+        for start in range(0, len(elements), self.model_inference_batch_size):
+            batch = elements[start : start + self.model_inference_batch_size]
+            model_inputs = [format_html_element_for_model(element) for element in batch]
+            encoded = tokenizer(
+                model_inputs,
+                padding=True,
+                truncation=True,
+                max_length=self.max_length,
+                return_tensors="pt",
+            ).to(self.device)
+            with torch.inference_mode():
+                logits = model(**encoded).logits
+                probabilities = torch.softmax(logits, dim=-1)
+                confidences, label_ids = torch.max(probabilities, dim=-1)
+
+            id2label = getattr(model.config, "id2label", {})
+            for label_id, confidence in zip(label_ids.cpu().tolist(), confidences.cpu().tolist(), strict=True):
+                predictions.append(
+                    HTMLElementPrediction(
+                        label=str(id2label.get(label_id, label_id)).lower(),
+                        confidence=float(confidence),
+                    )
+                )
+
+        return predictions


Same comment as before, is there a reason we aren't able to break this into a CPU-only tokenizer stage and a GPU-based model inference stage?

Signed-off-by: Zeel <desaizeel2128@gmail.com>

zeel2104 · 2026-05-21T01:39:18Z

@sarahyurick
Thanks, Sarah. I made another cleanup pass to address that point directly.

The remaining inline tokenization/inference path has been removed from ModelBasedHTMLExtractionStage, so the model-based Common Crawl flow now only exists through the staged pipeline:

candidate extraction
tokenizer stage
GPU inference stage
assembly stage

I also made ModelBasedHTMLExtractionStage a configuration wrapper for that staged path, and updated CommonCrawlHTMLExtractor to reject algorithm="model" directly so we don’t leave a misleading partially-supported path around.

greptile-apps · 2026-05-21T01:45:53Z

+
+        if html is not None:
+            # Language detection and HTML extraction
+            lang = lang_detect(html)
+
+            text = None


algorithm="model" entry point is unreachable from CommonCrawlDownloadExtractStage

The error message tells users to call CommonCrawlDownloadExtractStage(html_extraction='model'), but stage.py was not updated in this PR — it still passes html_extraction directly to CommonCrawlHTMLExtractor.__init__, where it hits this same ValueError. Any call to CommonCrawlDownloadExtractStage(html_extraction="model") will fail at construction with a self-referential error message. The stage.py file needs to detect html_extraction in {"model", "model_based"} and compose the new CommonCrawlModelBasedCandidateExtractor → TokenizerStage → ModelBasedHTMLInferenceStage → AssembleModelBasedHTMLExtractionStage pipeline instead of delegating to CommonCrawlHTMLExtractor.

+1 I think the PR is not usable as is.

fyi the PR #2075 is going to take over this work I think, thank you!

greptile-apps · 2026-05-21T01:45:54Z

+            CANDIDATE_TEXT_FIELD,
+            CANDIDATE_HTML_FIELD,
+            CANDIDATE_ATTRIBUTES_FIELD,
+            MODEL_INPUT_FIELD,


Return type incompatible with DocumentExtractor base class

CommonCrawlModelBasedCandidateExtractor.extract returns list[dict[str, Any]] | None, but DocumentExtractor.extract is declared dict[str, Any] | None. When this extractor is eventually plugged into DocumentIterateExtractStage, the stage calls extracted[self.filename_col] = record_dict[self.filename_col] — a string-keyed assignment on what is a list, raising a TypeError. The interface contract needs to be resolved: either DocumentIterateExtractStage must be taught to flatten list returns, or CommonCrawlModelBasedCandidateExtractor should not inherit DocumentExtractor and instead be used only in the custom multi-stage pipeline path.

zeel2104 requested a review from a team as a code owner April 8, 2026 17:14

zeel2104 requested review from huvunvidia and removed request for a team April 8, 2026 17:14

greptile-apps Bot reviewed Apr 8, 2026

View reviewed changes

sarahyurick self-requested a review April 8, 2026 18:24

Add model-based HTML extraction stage

6b187e7

Signed-off-by: Zeel <desaizeel2128@gmail.com>

zeel2104 force-pushed the zeel/model-based-html-extraction branch from 30944cf to 6b187e7 Compare April 8, 2026 20:15

sarahyurick reviewed Apr 13, 2026

View reviewed changes

github-actions Bot added the community-request label Apr 13, 2026

greptile-apps Bot reviewed Apr 13, 2026

View reviewed changes

sarahyurick reviewed Apr 14, 2026

View reviewed changes

chtruong814 added waiting-for-customer and removed waiting-for-customer labels Apr 14, 2026

chtruong814 added the needs-follow-up Issue needs follow-up label Apr 17, 2026

svcnvidia-nemo-ci added waiting-on-maintainers Waiting on maintainers to respond and removed needs-follow-up Issue needs follow-up labels Apr 21, 2026

sarahyurick reviewed May 8, 2026

View reviewed changes

svcnvidia-nemo-ci added waiting-on-customer Waiting on the original author to respond and removed waiting-on-maintainers Waiting on maintainers to respond labels May 8, 2026

zeel2104 force-pushed the zeel/model-based-html-extraction branch from 764ca28 to f7b3cfb Compare May 11, 2026 14:19

svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label May 11, 2026

sarahyurick reviewed May 11, 2026

View reviewed changes

svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label May 12, 2026

Remove inline model execution from HTML extractor

b627347

Signed-off-by: Zeel <desaizeel2128@gmail.com>

zeel2104 force-pushed the zeel/model-based-html-extraction branch from f7b3cfb to b627347 Compare May 21, 2026 01:38

greptile-apps Bot reviewed May 21, 2026

View reviewed changes

svcnvidia-nemo-ci added waiting-on-maintainers Waiting on maintainers to respond and removed waiting-on-customer Waiting on the original author to respond labels May 21, 2026

sarahyurick self-requested a review June 11, 2026 20:57

svcnvidia-nemo-ci added waiting-on-customer Waiting on the original author to respond and removed waiting-on-maintainers Waiting on maintainers to respond labels Jun 16, 2026

		import torch
		from transformers import AutoModelForSequenceClassification, AutoTokenizer

		if self._model is None or self._tokenizer is None:
		self._setup(local_files_only=self.local_files_only)


		return rendered_blocks

		def _get_classifier(self) -> HTMLElementClassifier:

Uh oh!

Conversation

zeel2104 commented Apr 8, 2026

Description

Usage

Uh oh!

copy-pr-bot Bot commented Apr 8, 2026

Uh oh!

greptile-apps Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

zeel2104 commented Apr 8, 2026

Uh oh!

greptile-apps Bot commented Apr 8, 2026

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeel2104 commented Apr 13, 2026

Uh oh!

greptile-apps Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sarahyurick commented Apr 13, 2026

Uh oh!

zeel2104 commented Apr 14, 2026

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeel2104 commented Apr 14, 2026

Uh oh!

zeel2104 commented Apr 15, 2026

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

greptile-apps Bot commented Apr 8, 2026 •

edited

Loading