Comparison with MAUVE ?

Interesting approach. Is there any reason T5Score is not compared to MAUVE (https://arxiv.org/abs/2102.01454) ? A recent comparison of similar metrics: https://arxiv.org/abs/2212.10020. Do you consider T5Score relevant for generative multi-hop QA assesment ?