Add model-based HTML extraction stage#1768
Conversation
Greptile SummaryThis PR adds a
Confidence Score: 3/5The new model-based stages are architecturally sound but the end-to-end feature is non-functional — the pipeline entry point raises an error and the candidate extractor has a type contract violation that would crash the iterator stage when wired up. Two defects affect the core advertised feature: CommonCrawlDownloadExtractStage(html_extraction='model') fails at construction with a self-referential error (stage.py was not updated), and CommonCrawlModelBasedCandidateExtractor.extract returns a list where the base class and DocumentIterateExtractStage expect a dict. Both must be resolved before the feature is usable. nemo_curator/stages/text/download/common_crawl/extract.py and nemo_curator/stages/text/download/common_crawl/stage.py need attention — the wiring between the new multi-stage pipeline and the existing CommonCrawlDownloadExtractStage is absent. Important Files Changed
Sequence DiagramsequenceDiagram
participant U as User
participant CCDES as CommonCrawlDownloadExtractStage
participant CCMCE as CommonCrawlModelBasedCandidateExtractor
participant TS as TokenizerStage
participant MBHIS as ModelBasedHTMLInferenceStage
participant AMBHES as AssembleModelBasedHTMLExtractionStage
note over CCDES,AMBHES: Intended model-based pipeline (NOT yet wired in stage.py)
U->>CCDES: "html_extraction=model"
CCDES-->>U: ValueError (self-referential error)
note over CCDES,AMBHES: Correct flow once stage.py is updated
CCDES->>CCMCE: extract HTML candidates per WARC record
CCMCE->>CCMCE: extract_candidate_elements(BeautifulSoup)
CCMCE-->>CCDES: list of candidate rows (1 row per HTML element)
CCDES->>TS: tokenize MODEL_INPUT_FIELD column
TS-->>CCDES: input_ids + attention_mask columns
CCDES->>MBHIS: classify elements (batched GPU inference)
MBHIS-->>CCDES: candidate_label + candidate_confidence columns
CCDES->>AMBHES: assemble per-document text
AMBHES->>AMBHES: group by (url, warc_id, source_id, language)
AMBHES->>AMBHES: render accepted elements to markdown/plain
alt "mean_confidence >= fallback_threshold"
AMBHES-->>CCDES: text from model predictions
else low confidence
AMBHES->>AMBHES: TrafilaturaExtractor.extract_text (fallback)
AMBHES-->>CCDES: text from fallback
end
Reviews (9): Last reviewed commit: "Remove inline model execution from HTML ..." | Re-trigger Greptile |
| classifier: HTMLElementClassifier | None = None, | ||
| fallback_extractor: HTMLExtractorAlgorithm | None = None, | ||
| transformers_init_kwargs: dict[str, Any] | None = None, | ||
| ): |
There was a problem hiding this comment.
local_files_only=True blocks model download out of the box
The default local_files_only=True means AutoTokenizer.from_pretrained and AutoModelForSequenceClassification.from_pretrained will refuse to hit the Hugging Face Hub and will raise an OSError (e.g., "Can't load tokenizer for 'opendatalab/MinerU-HTML-0.6B'") unless the model is already in the local cache. Since the default model_identifier is a public Hub model and the README usage example omits this parameter, any first-time user who follows the example will get an immediate failure with a confusing error message. The default should be False so the model is downloaded automatically.
| ): | |
| local_files_only: bool = False, |
There was a problem hiding this comment.
+1 local_files_only=False should be used in setup_on_node and local_files_only=True should be used in setup.
| for line in block.splitlines(): | ||
| cells = [cell.strip() for cell in line.strip("|").split("|")] | ||
| if cells and not all(cell == "---" for cell in cells): | ||
| rows.append("\t".join(cells)) |
There was a problem hiding this comment.
Aligned separator rows not filtered in plain-text table conversion
The separator filter checks all(cell == "---" for cell in cells), but GFM alignment indicators (":---", "---:", ":---:") are also valid separator row values. A table rendered with column alignment will have separator rows like | :--- | ---: |, which won't match "---" exactly and will be emitted as a data row in plain-text output. Consider using cell.strip(":") == "---" or a regex like r":?-{3,}:?" for the check.
| fence = "```" | ||
| if fence in code: | ||
| fence = "````" | ||
| return f"{fence}{language}\n{code.strip()}\n{fence}" |
There was a problem hiding this comment.
Fence escalation only covers one level
If the code content contains four consecutive backticks, neither the default ``` fence nor the escalated ```` fence will correctly delimit the block, producing broken Markdown. A more robust approach finds the longest run of backticks and uses one more:
import re
max_run = max((len(m.group()) for m in re.finditer(r"`+", code)), default=0)
fence = "`" * max(3, max_run + 1)Signed-off-by: Zeel <desaizeel2128@gmail.com>
30944cf to
6b187e7
Compare
|
Fixed: (after comments by greptile-bot)
|
|
Tip: Greploops — Automatically fix all review issues by running Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal. |
sarahyurick
left a comment
There was a problem hiding this comment.
Thank you @zeel2104 ! I left a few high-level comments about common Curator patterns and suggestions for changes. Please let me know if you have any questions.
| self, | ||
| model_identifier: str, | ||
| cache_dir: str | None, | ||
| device: Literal["cuda", "cpu"], |
There was a problem hiding this comment.
Maybe it should be GPU only.
| self._model: Any | None = None | ||
| self._tokenizer: Any | None = None | ||
|
|
||
| def _setup(self) -> None: |
There was a problem hiding this comment.
There should be a setup_on_node function which uses huggingface_hub's snapshot_download function to make sure we only download the model once.
Then, when the model is loaded with setup, it can do local_files_only=True.
| import torch | ||
| from transformers import AutoModelForSequenceClassification, AutoTokenizer |
There was a problem hiding this comment.
These can be top-level imports.
| if self.device == "cuda" and not torch.cuda.is_available(): | ||
| logger.warning("CUDA requested for model-based HTML extraction, but CUDA is unavailable. Using CPU.") | ||
| self.device = "cpu" |
There was a problem hiding this comment.
It should probably error in this case instead of warning.
| self._model.eval() | ||
|
|
||
| def predict(self, elements: list[HTMLElement]) -> list[HTMLElementPrediction]: | ||
| self._setup() |
There was a problem hiding this comment.
The setup function should never be called directly by predict or any other function within the stage. When running it as part as a Pipeline, the executor will call it.
| def predict(self, elements: list[HTMLElement]) -> list[HTMLElementPrediction]: | ||
| self._setup() | ||
|
|
||
| import torch |
There was a problem hiding this comment.
This can be a top-level import.
| for start in range(0, len(elements), self.batch_size): | ||
| batch = elements[start : start + self.batch_size] | ||
| model_inputs = [self._format_element(element) for element in batch] | ||
| encoded = tokenizer( | ||
| model_inputs, | ||
| padding=True, | ||
| truncation=True, | ||
| max_length=self.max_length, | ||
| return_tensors="pt", | ||
| ).to(self.device) | ||
| with torch.inference_mode(): | ||
| logits = model(**encoded).logits | ||
| probabilities = torch.softmax(logits, dim=-1) | ||
| confidences, label_ids = torch.max(probabilities, dim=-1) |
There was a problem hiding this comment.
This can be made into a CompositeStage with a TokenizerStage (should be able to just import from https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/text/models/tokenizer.py) and a GPU-based model stage.
There was a problem hiding this comment.
Bumping this. Each of tokenization and model inference should be its own stage. The reason we do this is so that we maximize GPU usage at any given moment. If CPU tokenization and GPU inference are in the same stage, then the GPU sits idle while waiting for tokenization happens. Let me know if I can help more with the refactor here.
| model_identifier: str = "opendatalab/MinerU-HTML-0.6B", | ||
| output_format: ModelBasedOutputFormat = "markdown", | ||
| fallback_threshold: float = 0.65, | ||
| device: Literal["cuda", "cpu"] = "cuda", |
There was a problem hiding this comment.
Should use Resources instead.
|
@sarahyurick
For the larger comments around |
| def _setup(self) -> None: | ||
| if self._model is not None and self._tokenizer is not None: | ||
| return | ||
|
|
||
| self._tokenizer = AutoTokenizer.from_pretrained( | ||
| self.model_identifier, | ||
| cache_dir=self.cache_dir, | ||
| local_files_only=self.local_files_only, | ||
| **self.transformers_init_kwargs, | ||
| ) | ||
| self._model = AutoModelForSequenceClassification.from_pretrained( | ||
| self.model_identifier, | ||
| cache_dir=self.cache_dir, | ||
| local_files_only=self.local_files_only, | ||
| **self.transformers_init_kwargs, | ||
| ) | ||
| if self.device == "cuda" and not torch.cuda.is_available(): | ||
| msg = "CUDA requested for model-based HTML extraction, but CUDA is unavailable." | ||
| raise RuntimeError(msg) | ||
| self._model.to(self.device) | ||
| self._model.eval() |
There was a problem hiding this comment.
CUDA check after model load leaves object in broken state
The CUDA availability check at line 128 runs after both self._tokenizer and self._model have already been assigned. When CUDA is unavailable the RuntimeError is raised, but at that point self._model is non-None (a CPU-loaded model). On the very next predict() call, _setup() sees both attributes non-None and returns early — the model is never moved to the intended device — then encoded.to(self.device) fails with a generic "No CUDA GPUs available" error instead of the helpful message, permanently.
Move the check to the top of _setup(), before any loading:
def _setup(self) -> None:
if self._model is not None and self._tokenizer is not None:
return
if self.device == "cuda" and not torch.cuda.is_available():
msg = "CUDA requested for model-based HTML extraction, but CUDA is unavailable."
raise RuntimeError(msg)
self._tokenizer = AutoTokenizer.from_pretrained(
self.model_identifier,
cache_dir=self.cache_dir,
local_files_only=self.local_files_only,
**self.transformers_init_kwargs,
)
self._model = AutoModelForSequenceClassification.from_pretrained(
self.model_identifier,
cache_dir=self.cache_dir,
local_files_only=self.local_files_only,
**self.transformers_init_kwargs,
)
self._model.to(self.device)
self._model.eval()There was a problem hiding this comment.
Fixed. I moved the CUDA availability check to the start of _setup() so we fail before assigning tokenizer/model state.
|
Thanks @zeel2104 ! Yes, I would prefer the larger changes to be in this PR too. |
|
Thanks, @sarahyurick . I updated the PR to align this path more closely with Curator’s stage lifecycle pattern. Changes in the latest push:
I also added coverage for the lifecycle/resource delegation path and reran:
Let me know if any changes required |
sarahyurick
left a comment
There was a problem hiding this comment.
Hi @zeel2104 thank you for the quick updates! I have left more comments. I think there is still some refactoring needed, let me know what you think.
| confidence: float | ||
|
|
||
|
|
||
| class HTMLElementClassifier(Protocol): |
There was a problem hiding this comment.
I was wondering why is this a Protocol?
| model_identifier: str, | ||
| cache_dir: str | None, | ||
| device: Literal["cuda", "cpu"], | ||
| batch_size: int, |
There was a problem hiding this comment.
This can be renamed to what we use for other model-based stages:
| batch_size: int, | |
| model_inference_batch_size: int, |
| if self._model is None or self._tokenizer is None: | ||
| msg = "Model-based HTML classifier was not initialized. Call setup() before inference." | ||
| raise RuntimeError(msg) |
There was a problem hiding this comment.
We probably should not have this. The Pipeline should always called setup.
| for start in range(0, len(elements), self.batch_size): | ||
| batch = elements[start : start + self.batch_size] | ||
| model_inputs = [self._format_element(element) for element in batch] | ||
| encoded = tokenizer( | ||
| model_inputs, | ||
| padding=True, | ||
| truncation=True, | ||
| max_length=self.max_length, | ||
| return_tensors="pt", | ||
| ).to(self.device) | ||
| with torch.inference_mode(): | ||
| logits = model(**encoded).logits | ||
| probabilities = torch.softmax(logits, dim=-1) | ||
| confidences, label_ids = torch.max(probabilities, dim=-1) |
There was a problem hiding this comment.
Bumping this. Each of tokenization and model inference should be its own stage. The reason we do this is so that we maximize GPU usage at any given moment. If CPU tokenization and GPU inference are in the same stage, then the GPU sits idle while waiting for tokenization happens. Let me know if I can help more with the refactor here.
|
@sarahyurick My current change threads lifecycle/resource hooks through the existing I think the right next step is to refactor this so that:
I’ll work on that refactor |
|
@sarahyurick I pushed a larger refactor for the model-based Common Crawl path. The main change is that the pipeline path for
I also renamed the model batch parameter to Let me know if this works for you |
sarahyurick
left a comment
There was a problem hiding this comment.
Hi @zeel2104 sorry for the delay in getting back to you here. I left some comments, my main confusion and concerns are around some of the classes in model_based.py. Let me know what you think.
| fallback_extractor=algorithm.fallback_extractor, | ||
| filename_column=filename_column, | ||
| ) | ||
| return [base_stage.decompose()[0], base_stage.decompose()[1], iterate_stage, tokenizer_stage, inference_stage, assemble_stage] |
There was a problem hiding this comment.
This logic looks okay, but I was a bit confused at first by base_stage.decompose()[0], base_stage.decompose()[1], iterate_stage. It seems like it could be base_stage.decompose()[0], base_stage.decompose()[1], base_stage.decompose()[2] with some type of check to make sure that base_stage is only 3 stages? Or alternatively, it could be url_generation_stage, download_stage, iterate_stage.
Some checks that the stages are indeed BaseCommonCrawlUrlGenerator, CommonCrawlWARCDownloader, and CommonCrawlWarcIterator would probably be good too.
There was a problem hiding this comment.
Maybe the file should just be called model.py, WDYT?
| self._model: Any | None = None | ||
| self._tokenizer: Any | None = None | ||
|
|
||
| def _setup(self, local_files_only: bool = True) -> None: |
There was a problem hiding this comment.
Can _setup_on_node be used to download the tokenizer and model files (e.g., with hf_hub_download), then _setup will just load them?
There was a problem hiding this comment.
Hmm, reading through the rest of the script, will ask more questions below.
| if self._model is None or self._tokenizer is None: | ||
| self._setup(local_files_only=self.local_files_only) |
There was a problem hiding this comment.
This can be removed, we want to make sure setup is always being called by the pipeline.
| model_identifier: str = "opendatalab/MinerU-HTML-0.6B", | ||
| output_format: ModelBasedOutputFormat = "markdown", | ||
| fallback_threshold: float = 0.65, | ||
| device: Literal["cuda"] = "cuda", |
There was a problem hiding this comment.
This should either support Literal["cpu, cuda"] or be removed and always use cuda.
| for start in range(0, len(elements), self.model_inference_batch_size): | ||
| batch = elements[start : start + self.model_inference_batch_size] | ||
| model_inputs = [format_html_element_for_model(element) for element in batch] | ||
| encoded = tokenizer( |
There was a problem hiding this comment.
Why does it still tokenize here if a TokenizerStage is already being used in stage.py?
|
|
||
| return rendered_blocks | ||
|
|
||
| def _get_classifier(self) -> HTMLElementClassifier: |
There was a problem hiding this comment.
I don't think this function should be needed right? Like setup can just set self.classifier and then that's all.
| confidence: float | ||
|
|
||
|
|
||
| class HTMLElementClassifier(ABC): |
There was a problem hiding this comment.
Do we need to keep this class if _TransformersHTMLElementClassifier is the only one?
| rendered_blocks = [ModelBasedHTMLExtractionStage._render_element(element, label) for element, label in accepted] | ||
| rendered_blocks = [block for block in rendered_blocks if block] | ||
| if not rendered_blocks: | ||
| return None | ||
|
|
||
| if self.output_format in {"plain", "plain_text"}: | ||
| rendered_blocks = [ModelBasedHTMLExtractionStage._markdown_block_to_plain_text(block) for block in rendered_blocks] |
There was a problem hiding this comment.
Could it be problematic to have ModelBasedHTMLExtractionStage hardcoded here? Like couldn't it not necessarily match the algorithm being used by CommonCrawlModelBasedCandidateExtractor?
| return {RayStageSpecKeys.IS_ACTOR_STAGE: True} | ||
|
|
||
| @staticmethod | ||
| def _extract_candidate_elements(soup: BeautifulSoup) -> list[HTMLElement]: |
There was a problem hiding this comment.
Maybe I am misunderstanding, there seems to be some circular and/or dead logic here. CommonCrawlModelBasedCandidateExtractor.extract calls ModelBasedHTMLExtractionStage._extract_candidate_elements, and ModelBasedHTMLExtractionStage.extract_text calls ModelBasedHTMLExtractionStage._extract_candidate_elements and ModelBasedHTMLExtractionStage._select_and_render_blocks. I guess I am confused where ModelBasedHTMLExtractionStage.extract_text is being used? It looks like nowhere but I could be missing something.
|
@sarahyurick Main changes:
|
764ca28 to
f7b3cfb
Compare
| model = self._model | ||
| tokenizer = self._tokenizer | ||
| predictions: list[HTMLElementPrediction] = [] | ||
| for start in range(0, len(elements), self.model_inference_batch_size): | ||
| batch = elements[start : start + self.model_inference_batch_size] | ||
| model_inputs = [format_html_element_for_model(element) for element in batch] | ||
| encoded = tokenizer( | ||
| model_inputs, | ||
| padding=True, | ||
| truncation=True, | ||
| max_length=self.max_length, | ||
| return_tensors="pt", | ||
| ).to(self.device) | ||
| with torch.inference_mode(): | ||
| logits = model(**encoded).logits | ||
| probabilities = torch.softmax(logits, dim=-1) | ||
| confidences, label_ids = torch.max(probabilities, dim=-1) | ||
|
|
||
| id2label = getattr(model.config, "id2label", {}) | ||
| for label_id, confidence in zip(label_ids.cpu().tolist(), confidences.cpu().tolist(), strict=True): | ||
| predictions.append( | ||
| HTMLElementPrediction( | ||
| label=str(id2label.get(label_id, label_id)).lower(), | ||
| confidence=float(confidence), | ||
| ) | ||
| ) | ||
|
|
||
| return predictions |
There was a problem hiding this comment.
Same comment as before, is there a reason we aren't able to break this into a CPU-only tokenizer stage and a GPU-based model inference stage?
Signed-off-by: Zeel <desaizeel2128@gmail.com>
f7b3cfb to
b627347
Compare
|
@sarahyurick The remaining inline tokenization/inference path has been removed from
I also made |
|
|
||
| if html is not None: | ||
| # Language detection and HTML extraction | ||
| lang = lang_detect(html) | ||
|
|
||
| text = None |
There was a problem hiding this comment.
algorithm="model" entry point is unreachable from CommonCrawlDownloadExtractStage
The error message tells users to call CommonCrawlDownloadExtractStage(html_extraction='model'), but stage.py was not updated in this PR — it still passes html_extraction directly to CommonCrawlHTMLExtractor.__init__, where it hits this same ValueError. Any call to CommonCrawlDownloadExtractStage(html_extraction="model") will fail at construction with a self-referential error message. The stage.py file needs to detect html_extraction in {"model", "model_based"} and compose the new CommonCrawlModelBasedCandidateExtractor → TokenizerStage → ModelBasedHTMLInferenceStage → AssembleModelBasedHTMLExtractionStage pipeline instead of delegating to CommonCrawlHTMLExtractor.
There was a problem hiding this comment.
+1 I think the PR is not usable as is.
fyi the PR #2075 is going to take over this work I think, thank you!
| CANDIDATE_TEXT_FIELD, | ||
| CANDIDATE_HTML_FIELD, | ||
| CANDIDATE_ATTRIBUTES_FIELD, | ||
| MODEL_INPUT_FIELD, |
There was a problem hiding this comment.
Return type incompatible with
DocumentExtractor base class
CommonCrawlModelBasedCandidateExtractor.extract returns list[dict[str, Any]] | None, but DocumentExtractor.extract is declared dict[str, Any] | None. When this extractor is eventually plugged into DocumentIterateExtractStage, the stage calls extracted[self.filename_col] = record_dict[self.filename_col] — a string-keyed assignment on what is a list, raising a TypeError. The interface contract needs to be resolved: either DocumentIterateExtractStage must be taught to flatten list returns, or CommonCrawlModelBasedCandidateExtractor should not inherit DocumentExtractor and instead be used only in the custom multi-stage pipeline path.
Description
Adds a model-based HTML extraction algorithm for Common Crawl extraction.
This introduces
ModelBasedHTMLExtractionStage, which classifies candidate HTML elements, preserves structured content such as fenced code blocks, math formulas, and Markdown tables, and falls back to Trafilatura when model confidence is low. It is wired intoCommonCrawlHTMLExtractorviaalgorithm="model"andalgorithm="model_based".Closes #1723.
Usage