-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rating methods comparison #33
Comments
@amirrr @markwhiting Here's a comparison between humans and LLaMA-3.1 in labeling 6 dimensions for statements in our previous corpus. Note that |
Thanks, do these look right to you? I note that we did have reversed classes for some features that we reverse in the start of the R script from the prior experiment, and just want to ensure that the ratings you're comparing with reflect the correct state. |
I think they should be correct. In the original data file (and R script), only |
OK, thanks. Qualitatively do you have any qualms with this? Also, how much will this change the design points we can cover with our current corpus? |
The numbers are lower than I expected, although upon manual inspection some human-given labels may not be reasonable. For example:
I'm not sure how much the design points will have to change, but I think we should do a manual evaluation of LLaMA-3.1's labels in the new dataset. Perhaps we can sample 200 statements, label them ourselves independently (at least by me and Amir), and then check whether we agree with each other and with the LLM. |
Thanks — yeah a manual pass seems worth it. At worst we will then have some ground truth to work with.
If we want to scale that a bit more we can also put a more sophisticated version of the task in front of turkers.
|
TODO:
|
I have prompted LLaMA-3-8B to classify all N = 10,110 statements here. The same features previously obtained by Amir are in this file. Note that they are only available for the first N = 8,814 statements. For comparison, I have put them together. For feature |
Here are 200 statements randomly sampled from those for which |
Will start working on them now. |
The text was updated successfully, but these errors were encountered: