CS 613: NLP - Assignment 2

Tokenizer & Model Training

November 17, 2024

Report Link

Prompts and Generated Text Link

Problem Statement

Task 1: Tokenizer Training

1.1 Tokenizer Training

Dataset Samples Description:
The curated scrapped data by all the team members was used to train the tokenizer. The dataset was called all_final_dataset. A random sampling from this dataset (with a fixed seed) was used to make datasets of fixed sizes. Additionally, a dataset containing only news articles was compiled called all_news_dataset, along with standalone datasets like Sangrah and Nepali Deva, as they were substantial in size on their own.

Dataset Sample 1: all_final_dataset (200 MB)
Dataset Sample 2: all_final_dataset (350 MB)
Dataset Sample 3: all_final_dataset (500 MB)
Dataset Sample 4: all_final_dataset (750 MB)
Dataset Sample 5: all_final_dataset (1 GB)
Dataset Sample 6: all_news_scrapped (350 MB)
Dataset Sample 7: Sangrah (750 MB)
Dataset Sample 8: Nepali_Deva (1.2 GB)

Tokenizer Details:

Tokenizer	Algorithm	Vocab Size	Dataset Information
Tokenizer 1	sentencePiece	49,152	`all_final_dataset` (200 MB)
Tokenizer 2	sentencePiece	49,152	`all_final_dataset` (350 MB)
Tokenizer 3	sentencePiece	49,152	`all_final_dataset` (500 MB)
Tokenizer 4	sentencePiece	49,152	`all_final_dataset` (750 MB)
Tokenizer 5	sentencePiece	49,152	`all_final_dataset` (1 GB)
Tokenizer 6	sentencePiece	49,152	`all_news_scrapped` (350 MB)
Tokenizer 7	sentencePiece	50,000	`Sangrah` (750 MB)

1.2 Fertility Score Calculation

Methodology:
Fertility scores were calculated for the trained tokenizers on both their respective training datasets and a held-out dataset (all_news_scrapped).

Results when tested on the same dataset it was trained on:

Tokenizer	Fertility Score	Dataset Info (Train)	Dataset Info (Test)
Tokenizer 1	1.2911565101	`all_final_dataset` (200 MB)	`all_final_dataset` (200 MB)
Tokenizer 2	1.2952214366	`all_final_dataset` (350 MB)	`all_final_dataset` (350 MB)
Tokenizer 3	1.2952510265	`all_final_dataset` (500 MB)	`all_final_dataset` (500 MB)
Tokenizer 4	1.2964268691	`all_final_dataset` (750 MB)	`all_final_dataset` (750 MB)
Tokenizer 5	1.2960094995	`all_final_dataset` (1 GB)	`all_final_dataset` (1 GB)
Tokenizer 7	1.2469413133	`Sangrah` (200 MB)	`Sangrah` (200 MB)

Results when tested on a fixed held-out dataset (all_news_scrapped):

Tokenizer	Fertility Score	Dataset Info (Train)	Dataset Info (Test)
Tokenizer 1	1.2529094399	`all_final_dataset` (200 MB)	`all_news_scrapped` (350 MB)
Tokenizer 2	1.2515642463	`all_final_dataset` (350 MB)	`all_news_scrapped` (350 MB)
Tokenizer 3	1.2509406335	`all_final_dataset` (500 MB)	`all_news_scrapped` (350 MB)
Tokenizer 4	1.2504782866	`all_final_dataset` (750 MB)	`all_news_scrapped` (350 MB)
Tokenizer 5	1.2515644638	`all_final_dataset` (1 GB)	`all_news_scrapped` (350 MB)

Task 2: Model Training

2.1 Model Selection

Model Architecture Chosen:
The architecture chosen is LLaMAForCausalLM, which is based on the transformer architecture optimized for causal language modeling tasks. It employs a multi-head self-attention mechanism with rotary embeddings, RMS normalization, and an efficient feed-forward network.

Adjustments Made:
To reduce the total number of parameters to below 100M, the following modifications were made:

Reduced the hidden_size to 512 (compared to higher defaults in larger models).
Set the number of attention heads to 8 with a head dimension of 64.
Kept 16 layers in the model (num_hidden_layers).
Set intermediate_size to 1024, balancing the computational load in the feed-forward layers.
Used a vocabulary size of 50,000, optimized for the dataset size.
Excluded additional biases in feed-forward layers (mlp_bias: false) to reduce parameters.

Total Parameters:
The final model has approximately 86M parameters, which is within the 100M constraint.

2.2 Tokenization of Dataset

Best Tokenizer Chosen: The tokenizer 7 (trained on Sangrah (750 MB)) was chosen for the model training.

Reason for Choice:
It had a perplexity of around 1.2469413133, which was the lowest among all the tokenizers trained on different datasets. Thus, we chose it as the best tokenizer for the model training.

2.3 Model Training

Training Details:

Number of Epochs: 10
Batch Size: 8

Perplexity Results per Epoch:

The following table shows the training loss, validation loss, and perplexity at various steps during model training. Every 500 steps corresponds to 0.1 epoch.

Steps	Training Loss	Validation Loss	Perplexity
500	7.918729	7.703134	2102.312241
1000	7.529400	7.505957	1818.844430
1500	7.207300	7.229811	1437.476760
2000	6.884200	6.893766	986.108044
2500	6.688900	6.741419	838.426617
3000	6.493600	6.508142	670.579000
3500	6.371900	6.391015	591.100484
4000	6.250200	6.223870	504.652362
4500	6.111200	6.113407	451.556748
5000	5.971700	6.002983	404.633863
5500	5.877500	5.910859	366.748795
6000	5.783300	5.816735	335.873564
6500	5.687900	5.723983	310.632056
7000	5.630000	5.672997	290.905210
7500	5.577500	5.599084	274.989125
8000	5.524400	5.547181	256.513335
8500	5.469400	5.487976	241.756643
9000	5.444300	5.444356	231.448171
9500	5.385400	5.398842	222.632577
10000	5.327800	5.347515	210.085508
10500	5.276100	5.306705	202.228561
11000	5.227900	5.265094	193.464456
11500	5.157300	5.214261	188.233344
12000	5.184900	5.194471	180.272732
12500	5.113100	5.126363	173.604526
13000	5.041700	5.132655	169.466479
13500	5.028200	5.092561	163.415618
14000	5.021400	5.080809	160.904193
14500	5.015000	5.068379	157.665767
15000	5.016100	5.031531	153.167311
15500	4.993000	5.009063	149.152026
16000	4.924900	4.986255	146.387131
16500	4.890000	4.967847	142.557062
17000	4.855600	4.947319	140.796985
17500	4.850000	4.915397	137.633544
18000	4.834600	4.908449	135.429160
18500	4.832000	4.885120	133.049802
19000	4.840000	4.873579	130.788174
19500	4.774500	4.853282	128.125492
20000	4.688000	4.849658	127.696651

Testing Prompts & Outputs

As Markdown does not support Devanagari characters, all the prompts and generated texts are documented in the Google Doc file linked here: Link to Google Doc

2.5 Other Trained Models

The best-trained model was chosen for the final submission. Below are the details of the other models that were trained:

Nepali LLM Version 1:
- Trained on: all_final_dataset (500 MB)
- Model Architecture: LlamaForCausalLM
- Parameters: 86.000128 Million
- Tokenizer Used: Tokenizer 4 (trained on all_final_dataset (750 MB))
- Perplexity: 4000
- Epochs: 10
- Training Split: 80:20
Nepali LLM Version 2:
- Trained on: all_final_dataset (200 MB)
- Model Architecture: LlamaForCausalLM
- Parameters: 86.000128 Million
- Tokenizer Used: Tokenizer 4 (trained on all_final_dataset (750 MB))
- Perplexity: 171
- Epochs: 10
- Training Split: 90:10

References

Thakur, Aamod. NLP Pre-Training. GitHub

Work Distribution

Aditya Mehta (Roll Number: 22110017)
- Tokenized the all_news_scrapped dataset and identified duplicates with other datasets.
- Worked on the tokenization process of the Nepali_Deva text corpus.
- Prepared the entire report.
Daksh Jain (Roll Number: 22110066)
- Tokenized a Nepali Wikipedia-scraped dataset and evaluated fertility scores to determine the best tokenization strategy.
- Tested scalability by tokenizing datasets of various sizes.
- Ran prompt tests on the final model to validate its performance.
Hrriday Ruparel (Roll Number: 22110099)
- Wrote scripts for dataset merging, tokenization, and model training.
- Trained tokenizers on the scraped dataset.
- Trained an 80M model on the scraped dataset.
Kishan Ved (Roll Number: 22110122)
- Trained five different tokenizers and identified the one with the best fertility score.
- Trained an 86M parameter model on the scraped dataset.
- Trained a tokenizer on the Sangrah dataset and tokenized it.
- Trained an 86M parameter model on the Sangrah dataset.
Summet Sawale (Roll Number: 22110234)
- Assisted in writing code for tokenizing text into tensors of context length.
- Tested models using prompts.
- Documented testing results.
- Downloaded the scraped data from the server.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
Prompts_and_Generated_Text.pdf		Prompts_and_Generated_Text.pdf
README.md		README.md
Tokenization_and_Pre_Training_report.pdf		Tokenization_and_Pre_Training_report.pdf
Tokenization_of_data.ipynb		Tokenization_of_data.ipynb
Tokenizer_training_final.ipynb		Tokenizer_training_final.ipynb
nlp-nepali-training.ipynb		nlp-nepali-training.ipynb
sangrah_tokenizers.zip		sangrah_tokenizers.zip
testing.ipynb		testing.ipynb
tokenizers.zip		tokenizers.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS 613: NLP - Assignment 2

Report Link

Prompts and Generated Text Link

Problem Statement

Task 1: Tokenizer Training

1.1 Tokenizer Training

1.2 Fertility Score Calculation

Task 2: Model Training

2.1 Model Selection

2.2 Tokenization of Dataset

2.3 Model Training

Testing Prompts & Outputs

2.5 Other Trained Models

References

Work Distribution

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CS 613: NLP - Assignment 2

Report Link

Prompts and Generated Text Link

Problem Statement

Task 1: Tokenizer Training

1.1 Tokenizer Training

1.2 Fertility Score Calculation

Task 2: Model Training

2.1 Model Selection

2.2 Tokenization of Dataset

2.3 Model Training

Testing Prompts & Outputs

2.5 Other Trained Models

References

Work Distribution

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages