-
Notifications
You must be signed in to change notification settings - Fork 0
Benchmarks
General Language Understanding Evaluation (GLUE)
Homepage
Paper
Github
Stanford Question Answering Dataset (SQuAD)
Homepage
arxiv.org SQuAD 1 SQuAD 2
GLGE: A New General Language Generation Evaluation Benchmark
Natural Language Generation, 24 Taks 3 Difficulties; contains MASS, BART, and Prophet-Net Baselines
arxiv.org 2021 Link
Microsoft Research, College of Computer Science Sichuan University, Dayiheng Liu, et al.
Public Repository and guide https://microsoft.github.io/glge/
BERTSCORE: EVALUATING TEXT GENERATION WITH BERT
automatic evaluation metric for text generation
Cite LinkPaperGithub
SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference
Situations With Adversarial Generations
Homepage Paper
HellaSwag: Can a Machine Really Finish Your Sentence?
Like SWAG but harder
Homepage 2019 Paper