Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPT Statement cleaning #9

Open
1 of 7 tasks
markwhiting opened this issue Oct 3, 2023 · 3 comments
Open
1 of 7 tasks

GPT Statement cleaning #9

markwhiting opened this issue Oct 3, 2023 · 3 comments
Assignees

Comments

@markwhiting
Copy link
Member

markwhiting commented Oct 3, 2023

Run multi stage pipeline for getting only very clean statements.

They should be clear, and make sense.

Pipeline:

  • Ask GPT to filter generally
  • Filter for strange proper nouns
  • Filter for normal sentences
  • Convert all names to gender non specific names, e.g., Max, Alex, Sam

Meta tasks:

  • Do some testing to compare before and after commonsense and labels of statements
  • Commit pipeline as an action in statements repo (whatever is fine but something that lets us update continuously and something that will be low effort to maintain)
  • Update statements repo
@markwhiting markwhiting changed the title Statement cleaning GPT Statement cleaning Oct 5, 2023
@amirrr
Copy link
Collaborator

amirrr commented Jan 8, 2024

Lets have table with the cleanest statements from the GPT pipeline,

  1. statement (text - id)
  2. design point
  3. quantile rank within design point (design point quantile for commonsensicality within that design point)
  4. source

@joshnguyen99
Copy link

Here's what I think will work. Proper nouns and names can be detected by spacy, even when they're in lower case.

For example, given the statement with ID = 1361, namely if jake considers john's example he would become the strongest and fittest person ever, if I do

import spacy
nlp = spacy.load("en_core_web_trf")
doc = nlp("if jake considers john's example he would become the strongest and fittest person ever")
for tok in doc:
    if tok.pos_ == "PROPN":
        print(tok)

The output would be

jake
john

From this we can do 2 things:

  1. Apply a simple (heuristic) rule to capitalize these proper nouns.
  2. Do it ourselves. I'm guessing the number of statements won't be to large for us to handle manually for this 4k corpus.

@markwhiting
Copy link
Member Author

Cool, I think removing them with GPT is mostly ok, except when it removes them too much, in which case it tends to make the statement meaningless, but in most cases, those are actually statements we should drop. e.g., Florida is a nice placethis state is a nice place, and both of those are not particularly useful statements. So I think what Amir is doing is sufficient. But I think we may need to continue to think about refining our filtering.

@markwhiting markwhiting transferred this issue from Watts-Lab/commonsense-platform Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants