Is your feature request related to a problem? Please describe.
The eval configuration for Nano3 currently includes only the following tasks: adlr_mmlu, adlr_arc_challenge_llama_25_shot, adlr_winogrande_5_shot, hellaswag, and openbookqa. I don’t see configurations for the benchmarks reported in Table 3 of the Nemotron 3 Nano paper.
I tried manually configuring LiveCodeBench, SciCode, IFBench, RULER, MMLU-Pro, and MMLU-Prox, and reran evaluation on nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with vllm deployment, but my results didn’t match the reported numbers. I suspect this is due to differences in the evaluation setup.
Could you point me to the full evaluation configuration used for the reported benchmarks? I wasn’t able to find detailed information about the eval setup in the paper.
Any guidance would be greatly appreciated!
Is your feature request related to a problem? Please describe.
The eval configuration for Nano3 currently includes only the following tasks:
adlr_mmlu,adlr_arc_challenge_llama_25_shot,adlr_winogrande_5_shot,hellaswag, andopenbookqa. I don’t see configurations for the benchmarks reported in Table 3 of the Nemotron 3 Nano paper.I tried manually configuring LiveCodeBench, SciCode, IFBench, RULER, MMLU-Pro, and MMLU-Prox, and reran evaluation on nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 with
vllmdeployment, but my results didn’t match the reported numbers. I suspect this is due to differences in the evaluation setup.Could you point me to the full evaluation configuration used for the reported benchmarks? I wasn’t able to find detailed information about the eval setup in the paper.
Any guidance would be greatly appreciated!