A relatively simple question, that I couldn't quite clarify by looking through the tech report...
During your pretraining (report section 3.1) or instruction tuning phases (report section 3.2), any time samples are "packed together" does your pipeline allow attention masks to cross document boundaries?
A relatively simple question, that I couldn't quite clarify by looking through the tech report...
During your pretraining (report section 3.1) or instruction tuning phases (report section 3.2), any time samples are "packed together" does your pipeline allow attention masks to cross document boundaries?