Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions examples/aquila/conf/train/7b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ data:
data_path: ${data_path:??}
split: 1
tokenizer:
legacy_tokenizer: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While adding legacy_tokenizer: true correctly resolves the immediate issue, it introduces technical debt by relying on a legacy implementation. For long-term maintainability, it would be beneficial to create a follow-up task to migrate all affected example configurations to the new tokenizer implementation. This would ensure the examples stay aligned with the latest framework features.

tokenizer_type: AquilaTokenizerFS
vocab_file: ./examples/aquila/tokenizer/vocab.json
merge_file: ./examples/aquila/tokenizer/merges.txt
Expand Down
1 change: 1 addition & 0 deletions examples/deepseek_v3/conf/train/16b_a3b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,7 @@ data:
split: 1
no_mmap_bin_files: true
tokenizer:
legacy_tokenizer: true
tokenizer_type: QwenTokenizerFS
tokenizer_path: examples/aquila/qwentokenizer
vocab_size: 151851
Expand Down
1 change: 1 addition & 0 deletions examples/llama2/conf/train/7b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ data:
data_path: ${data_path:??}
split: 1
tokenizer:
legacy_tokenizer: true
tokenizer_type: Llama2Tokenizer
tokenizer_model: examples/llama/tokenizer.model
vocab_size: 32000
1 change: 1 addition & 0 deletions examples/llama3/conf/train/70b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ data:
data_path: ${data_path:??}
split: 1
tokenizer:
legacy_tokenizer: true
tokenizer_type: Llama3TokenizerFS
tokenizer_path: ${tokenizer_path:??}
vocab_size: 128256
1 change: 1 addition & 0 deletions examples/llava1_5/conf/train/7b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ data:
dataloader_type: external
split: 100,0,0
tokenizer:
legacy_tokenizer: true
tokenizer_type: Llama2Tokenizer
tokenizer_model: ${tokenizer_model_path:??}
vocab_size: 32000
1 change: 1 addition & 0 deletions examples/llava_onevision/conf/train/1_5b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ data:
dataloader_type: external
split: 100,0,0
tokenizer:
legacy_tokenizer: true
tokenizer_type: Qwen2TokenizerFS
tokenizer_path: xxxx
# vocab_size: 152064 # 7b
Expand Down
1 change: 1 addition & 0 deletions examples/mixtral/conf/train/8x7b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ data:
data_path: <xxxx>
split: 1
tokenizer:
legacy_tokenizer: true
tokenizer_type: QwenTokenizerFS
tokenizer_path: <xxxx>
make_vocab_size_divisible_by: 64
1 change: 1 addition & 0 deletions examples/qwen2_5/conf/train/1_5b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ data:
split: 1
apply_sft_dataset_separated_loss_mask_if_existed: true
tokenizer:
legacy_tokenizer: true
tokenizer_type: HFTokenizerFS
tokenizer_path: ${HF_model_path:??}
vocab_size: 151665
1 change: 1 addition & 0 deletions examples/qwen2_5_vl/conf/train/7b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ data:
dataloader_type: external
split: 100,0,0
tokenizer:
legacy_tokenizer: true
tokenizer_type: Qwen2VLTokenizer
tokenizer_path: xxxx
vocab_size: 152064 # 7b
Expand Down
1 change: 1 addition & 0 deletions examples/qwen3/conf/train/32b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ data:
split: 1
no_mmap_bin_files: true
tokenizer:
legacy_tokenizer: true
tokenizer_type: QwenTokenizerFS
tokenizer_path: examples/aquila/qwentokenizer
vocab_size: 151851
Expand Down
1 change: 1 addition & 0 deletions examples/qwq/conf/train/32b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ data:
split: 1
no_mmap_bin_files: true
tokenizer:
legacy_tokenizer: true
tokenizer_type: Qwen2TokenizerFS
tokenizer_path: /tokenizer_path
vocab_size: 151851
Expand Down
1 change: 1 addition & 0 deletions examples/rwkv/conf/train/3b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,4 +45,5 @@ data:
data_path: ${data_path:??}
split: "1"
tokenizer:
legacy_tokenizer: true
tokenizer_path: ${tokenizer_path:??} # The vocab file can be found at https://github.com/RWKV-Vibe/RWKV-LM-V7/tree/main/data/tokenizer
Loading