Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions examples/aquila/conf/train/7b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ data:
data_path: ${data_path:??}
split: 1
tokenizer:
legacy_tokenizer: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While adding legacy_tokenizer: true correctly resolves the immediate issue, it introduces technical debt by relying on a legacy implementation. For long-term maintainability, it would be beneficial to create a follow-up task to migrate all affected example configurations to the new tokenizer implementation. This would ensure the examples stay aligned with the latest framework features.

tokenizer_type: AquilaTokenizerFS
vocab_file: ./examples/aquila/tokenizer/vocab.json
merge_file: ./examples/aquila/tokenizer/merges.txt
Expand Down
1 change: 1 addition & 0 deletions examples/deepseek_v3/conf/train/16b_a3b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,7 @@ data:
split: 1
no_mmap_bin_files: true
tokenizer:
legacy_tokenizer: true
tokenizer_type: QwenTokenizerFS
tokenizer_path: examples/aquila/qwentokenizer
vocab_size: 151851
Expand Down
1 change: 1 addition & 0 deletions examples/llama2/conf/train/7b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ data:
data_path: ${data_path:??}
split: 1
tokenizer:
legacy_tokenizer: true
tokenizer_type: Llama2Tokenizer
tokenizer_model: examples/llama/tokenizer.model
vocab_size: 32000
1 change: 1 addition & 0 deletions examples/llama3/conf/train/70b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ data:
data_path: ${data_path:??}
split: 1
tokenizer:
legacy_tokenizer: true
tokenizer_type: Llama3TokenizerFS
tokenizer_path: ${tokenizer_path:??}
vocab_size: 128256
1 change: 1 addition & 0 deletions examples/llava1_5/conf/train/7b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ data:
dataloader_type: external
split: 100,0,0
tokenizer:
legacy_tokenizer: true
tokenizer_type: Llama2Tokenizer
tokenizer_model: ${tokenizer_model_path:??}
vocab_size: 32000
1 change: 1 addition & 0 deletions examples/llava_onevision/conf/train/1_5b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ data:
dataloader_type: external
split: 100,0,0
tokenizer:
legacy_tokenizer: true
tokenizer_type: Qwen2TokenizerFS
tokenizer_path: xxxx
# vocab_size: 152064 # 7b
Expand Down
1 change: 1 addition & 0 deletions examples/mixtral/conf/train/8x7b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ data:
data_path: <xxxx>
split: 1
tokenizer:
legacy_tokenizer: true
tokenizer_type: QwenTokenizerFS
tokenizer_path: <xxxx>
make_vocab_size_divisible_by: 64
1 change: 1 addition & 0 deletions examples/qwen2_5/conf/train/1_5b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ data:
split: 1
apply_sft_dataset_separated_loss_mask_if_existed: true
tokenizer:
legacy_tokenizer: true
tokenizer_type: HFTokenizerFS
tokenizer_path: ${HF_model_path:??}
vocab_size: 151665
1 change: 1 addition & 0 deletions examples/qwen2_5_vl/conf/train/7b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ data:
dataloader_type: external
split: 100,0,0
tokenizer:
legacy_tokenizer: true
tokenizer_type: Qwen2VLTokenizer
tokenizer_path: xxxx
vocab_size: 152064 # 7b
Expand Down
1 change: 1 addition & 0 deletions examples/qwen3/conf/train/32b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ data:
split: 1
no_mmap_bin_files: true
tokenizer:
legacy_tokenizer: true
tokenizer_type: QwenTokenizerFS
tokenizer_path: examples/aquila/qwentokenizer
vocab_size: 151851
Expand Down
1 change: 1 addition & 0 deletions examples/qwq/conf/train/32b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ data:
split: 1
no_mmap_bin_files: true
tokenizer:
legacy_tokenizer: true
tokenizer_type: Qwen2TokenizerFS
tokenizer_path: /tokenizer_path
vocab_size: 151851
Expand Down
1 change: 1 addition & 0 deletions examples/rwkv/conf/train/3b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,4 +45,5 @@ data:
data_path: ${data_path:??}
split: "1"
tokenizer:
legacy_tokenizer: true
tokenizer_path: ${tokenizer_path:??} # The vocab file can be found at https://github.com/RWKV-Vibe/RWKV-LM-V7/tree/main/data/tokenizer
Loading