Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rating methods comparison #33

Open
amirrr opened this issue Nov 19, 2024 · 11 comments
Open

Rating methods comparison #33

amirrr opened this issue Nov 19, 2024 · 11 comments

Comments

@amirrr
Copy link
Collaborator

amirrr commented Nov 19, 2024

image

@amirrr
Copy link
Collaborator Author

amirrr commented Nov 19, 2024

gpt vs llama3 llama3 vs llama2 gpt vs llama2
Image 1 Image 2 Image 3

@joshnguyen99
Copy link

SCR-20241203-gjq

@amirrr @markwhiting Here's a comparison between humans and LLaMA-3.1 in labeling 6 dimensions for statements in our previous corpus.

Note that figure_of_speech is the negative class for literal language. If it were the positive class, the F-1 score would just be 0.48.

@markwhiting
Copy link
Member

Thanks, do these look right to you? I note that we did have reversed classes for some features that we reverse in the start of the R script from the prior experiment, and just want to ensure that the ratings you're comparing with reflect the correct state.

@joshnguyen99
Copy link

joshnguyen99 commented Dec 3, 2024

I think they should be correct. In the original data file (and R script), only everyday was coded as 1. Every other variable should be flipped. I flipped because the coded 1 value corresponds to the direction which has higher commonsensicality for humans.

@markwhiting
Copy link
Member

OK, thanks. Qualitatively do you have any qualms with this?

Also, how much will this change the design points we can cover with our current corpus?

@joshnguyen99
Copy link

The numbers are lower than I expected, although upon manual inspection some human-given labels may not be reasonable.

For example:

  • "We should ensure police have body cameras on them." -> Humans said fact, LLM said opinion
  • "The first thing you do when you have an asthma attack is use the inhaler." -> Humans said social, LLM said physical

I'm not sure how much the design points will have to change, but I think we should do a manual evaluation of LLaMA-3.1's labels in the new dataset. Perhaps we can sample 200 statements, label them ourselves independently (at least by me and Amir), and then check whether we agree with each other and with the LLM.

@markwhiting
Copy link
Member

markwhiting commented Dec 3, 2024 via email

@joshnguyen99
Copy link

TODO:

  • Dataset for statements is here.
  • Select only statements for which published is True. These are statements that are on the platform already.
  • Randomly select 200 statements. Try multiple ways.
  • Coordinate labeling.
  • Prompt different LLMs for classification.

@joshnguyen99
Copy link

joshnguyen99 commented Dec 20, 2024

@markwhiting @amirrr

I have prompted LLaMA-3-8B to classify all N = 10,110 statements here.

The same features previously obtained by Amir are in this file. Note that they are only available for the first N = 8,814 statements.

For comparison, I have put them together. For feature X, X_old is Amir's version, while X is mine. Note that X_old is only available for the first 8,814 rows.

statements_features_comparison.csv

@joshnguyen99
Copy link

Here are 200 statements randomly sampled from those for which published is True. I will label them myself and hope to get @amirrr's ratings soon as well.

to_label_200.csv

@amirrr
Copy link
Collaborator Author

amirrr commented Dec 20, 2024

Here are 200 statements randomly sampled from those for which published is True. I will label them myself and hope to get @amirrr's ratings soon as well.

to_label_200.csv

Will start working on them now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants