Skip to content

Latest commit

 

History

History
147 lines (109 loc) · 4.55 KB

File metadata and controls

147 lines (109 loc) · 4.55 KB

Project image

Portuguese Wikipedia Dump Extraction

This project downloads the latest Portuguese Wikipedia XML dump and extracts filtered, cleaned article text into JSONL and Parquet for downstream NLP/LLM workflows.

Source dump:

  • https://dumps.wikimedia.org/ptwiki/latest/ptwiki-latest-pages-articles.xml.bz2

Setup

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Run

Quick subset:

python -m wiki_pt_extract.cli --max-pages 2000 --skip-redirects

Full extraction:

python -m wiki_pt_extract.cli --skip-redirects

The program creates data/ and data/raw/ automatically.

Filtering Pipeline (Execution Order)

Filtering happens in build_filtered_rows() and is applied in this order:

  1. Read each <page> from the compressed XML dump (iter_pages_from_bz2), extracting:
    • title
    • page_id
    • ns (namespace)
    • latest revision text
  2. Stop early if --max-pages is reached.
  3. Namespace filter:
    • default: keep only main namespace (ns == 0)
    • use --include-non-main to keep other namespaces
  4. Text normalization:
    • normalize line breaks (\r\n / \r -> \n)
    • trim leading/trailing whitespace
  5. Wikitext cleaning (_clean_wikitext):
    • optionally remove list-like lines (*, #, ;, :) unless --keep-lists is set
    • remove media/file wikilinks (File:, Ficheiro:, Imagem:, Image:)
    • remove all templates
    • remove noisy tags: ref, references, table, gallery, math, code, syntaxhighlight, timeline, pre, source
    • split into sections (lead included), strip markup to plain text, and apply regex cleanup for media/options leftovers, empty brackets, and excess whitespace
    • keep only sections with at least --min-section-chars characters (default: 1)
    • join kept sections into final text, and also store per-section output in section_texts
  6. Empty-content removal:
    • if cleaned text is empty, the page is dropped
  7. Redirect filtering:
    • applied only when --skip-redirects is set
    • drops pages whose cleaned text starts with #redirect or #redirecionamento (case-insensitive)
  8. Disambiguation filtering:
    • default: drop disambiguation pages
    • current implementation checks title pattern like (desambiguação) and also applies a {{desambigua...}} regex over the current text value
    • use --include-disambiguation to keep them
  9. Remaining pages are written to JSONL and Parquet.

At the end, the CLI prints counters for:

  • pages seen
  • pages written
  • redirects skipped
  • empty texts skipped
  • non-main namespace skipped
  • disambiguation skipped

Deduplication Behavior

There is currently no explicit post-cleaning deduplication step in the pipeline (for example, no dedup by page_id, title, or text hash).

In practice, deduplication mostly relies on dump structure:

  • pages-articles already provides one current revision per page entry
  • if two different pages have identical cleaned content, both are kept

Output Files

  • data/ptwiki_articles1.jsonl
  • data/ptwiki_articles1.parquet
  • shard files during parquet generation:
    • data/ptwiki_articles1_part_00001.parquet, data/ptwiki_articles1_part_00002.parquet, ...

Parquet writing is batched, then shards are merged into data/ptwiki_articles1.parquet.

Schema

Each row contains:

  • text: cleaned plain text
  • title: page title
  • page_id: page ID from XML
  • ns: namespace ID
  • section_texts: list of cleaned section texts (lead included)

Filters and Flags

Behavior Default Flag to change
Keep only main namespace (ns == 0) Enabled --include-non-main
Remove disambiguation pages Enabled --include-disambiguation
Remove redirect pages Disabled --skip-redirects
Remove list-like lines from text Enabled --keep-lists
Minimum section length 1 char --min-section-chars <int>
Max processed pages Unlimited --max-pages <int>

Publish to Hugging Face

Dataset name: wikipedia-pt-br-extract

  1. Generate outputs:
python -m wiki_pt_extract.cli --skip-redirects
  1. Upload:
pip install -r requirements.txt
python scripts/publish_hf_dataset.py --repo wikipedia-pt-br-extract

Upload target: https://huggingface.co/datasets/<your-username>/wikipedia-pt-br-extract

Token Counting (Qwen3 tokenizer)

JSONL:

python scripts/count_tokens.py --input data/ptwiki_articles1.jsonl --format jsonl

Parquet:

python scripts/count_tokens.py --input data/ptwiki_articles1.parquet --format parquet