Description of Long-Context Tasks in the Eval Harness

These tasks can be found in ./tasks.py and are invoked from the eval.py harness with the --tasks parameter.

Synthetic

RULER

RULER defines a set of synthetic tasks designed to test a model’s long-context understanding.

Tasks include needle in a haystack (NIAH), variable tracking (VT), question answering (QA), and common word extraction (CWE).

Domain-Specific

Dolomites

Evaluates the model’s ability to perform domain-specific methodical writing tasks such as writing a differential diagnosis for a patient, or writing a lesson plan for students.

Coding

RepoBench

This task tests the model’s ability to understand coding repositories and make correct predictions for code completion.

QA

MuSiQue

MuSiQue is a question-answering dataset that tests the model’s ability to perform multihop reasoning over a long input context.

TruthfulQA

TruthfulQA tests the models ability to answer questions truthfully across a broad set of categories such as health, law, finance, and politics.

Language Modeling

PG19

This task tests the model’s ability to generate longform text (~8K tokens) by providing a title and first initial words of a book.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BENCHMARK.md

BENCHMARK.md

Description of Long-Context Tasks in the Eval Harness

Synthetic

RULER

Domain-Specific

Dolomites

Coding

RepoBench

QA

MuSiQue

TruthfulQA

Language Modeling

PG19

Summarization

QMSum

SQuALITY

QuALITY

Files

BENCHMARK.md

Latest commit

History

BENCHMARK.md

File metadata and controls

Description of Long-Context Tasks in the Eval Harness

Synthetic

Domain-Specific

Coding

QA

Language Modeling

Summarization