These tasks can be found in ./tasks.py
and are invoked from the eval.py
harness with the --tasks
parameter.
RULER defines a set of synthetic tasks designed to test a model’s long-context understanding.
Tasks include needle in a haystack (NIAH), variable tracking (VT), question answering (QA), and common word extraction (CWE).
Evaluates the model’s ability to perform domain-specific methodical writing tasks such as writing a differential diagnosis for a patient, or writing a lesson plan for students.
This task tests the model’s ability to understand coding repositories and make correct predictions for code completion.
MuSiQue is a question-answering dataset that tests the model’s ability to perform multihop reasoning over a long input context.
TruthfulQA tests the models ability to answer questions truthfully across a broad set of categories such as health, law, finance, and politics.
This task tests the model’s ability to generate longform text (~8K tokens) by providing a title and first initial words of a book.
A meeting summarization dataset that evaluates the model’s ability to select and summarize content that is relevant to the given query.
SQuALITY is a question-focused summarization dataset, which tests the models ability to understand long narratives and select and summarize content relevant to the provided question.
QuALITY tests the model’s ability to understand and answer questions about long narratives.