Skip to content

The story generation evaluation task has encountered a problem #6

@PkuRec4Dis

Description

@PkuRec4Dis

Hello author, I am very interested in your work. I encountered a problem when testing story generation evaluation. Using your open-source model and the open-source code prompt, my test results on 1000 stories from OpenMEVA yielded 0.22, while the metric reported in the paper is approximately 0.5. Could there be any operational errors on my part?

Additionally, there is another question. In the OpenMEVA dataset, these settings don't actually exist. How were these parameters designed, and how much impact do the examples have on the actual results?
{source_des}:\n
{source}\n
\n
{target_des}:\n
{target}\n\

task : Story Generation
Prompt:
PROMPT = "###Instruction###\n
Please act as an impartial and helpful evaluator for natural language generation (NLG), and the audience is an expert in the field.\n
Your task is to evaluate the quality of {task} strictly based on the given evaluation criterion.\n
Begin the evaluation by providing your analysis concisely and accurately, and then on the next line, start with "Rating:" followed by your rating on a Likert scale from 1 to 5 (higher means better).\n
You MUST keep to the strict boundaries of the evaluation criterion and focus solely on the issues and errors involved; otherwise, you will be penalized.\n
Make sure you read and understand these instructions, as well as the following evaluation criterion and example content, carefully.\n
\n
###Evaluation Criterion###\n
{aspect}\n
\n
###Example###\n
{source_des}:\n
{source}\n
\n
{target_des}:\n
{target}\n
\n
###Your Evaluation###\n"

Model:https://huggingface.co/PKU-ONELab/Themis

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions