GPT Statement cleaning #9

markwhiting · 2023-10-03T19:20:34Z

Run multi stage pipeline for getting only very clean statements.

They should be clear, and make sense.

Pipeline:

Ask GPT to filter generally
Filter for strange proper nouns
Filter for normal sentences
Convert all names to gender non specific names, e.g., Max, Alex, Sam

Meta tasks:

Do some testing to compare before and after commonsense and labels of statements
Commit pipeline as an action in statements repo (whatever is fine but something that lets us update continuously and something that will be low effort to maintain)
Update statements repo

amirrr · 2024-01-08T20:31:18Z

Lets have table with the cleanest statements from the GPT pipeline,

statement (text - id)
design point
quantile rank within design point (design point quantile for commonsensicality within that design point)
source

joshnguyen99 · 2024-01-18T04:49:39Z

Here's what I think will work. Proper nouns and names can be detected by spacy, even when they're in lower case.

For example, given the statement with ID = 1361, namely if jake considers john's example he would become the strongest and fittest person ever, if I do

import spacy
nlp = spacy.load("en_core_web_trf")
doc = nlp("if jake considers john's example he would become the strongest and fittest person ever")
for tok in doc:
    if tok.pos_ == "PROPN":
        print(tok)

The output would be

jake
john

From this we can do 2 things:

Apply a simple (heuristic) rule to capitalize these proper nouns.
Do it ourselves. I'm guessing the number of statements won't be to large for us to handle manually for this 4k corpus.

markwhiting · 2024-01-18T04:56:54Z

Cool, I think removing them with GPT is mostly ok, except when it removes them too much, in which case it tends to make the statement meaningless, but in most cases, those are actually statements we should drop. e.g., Florida is a nice place → this state is a nice place, and both of those are not particularly useful statements. So I think what Amir is doing is sufficient. But I think we may need to continue to think about refining our filtering.

markwhiting changed the title ~~Statement cleaning~~ GPT Statement cleaning Oct 5, 2023

This was referenced Oct 6, 2023

Integrative experiment of statement types Watts-Lab/commonsense-platform#86

Open

Automated statement labeling #10

Open

markwhiting assigned amirrr Oct 25, 2023

markwhiting transferred this issue from Watts-Lab/commonsense-platform Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT Statement cleaning #9

GPT Statement cleaning #9

markwhiting commented Oct 3, 2023 •

edited by amirrr

Loading

amirrr commented Jan 8, 2024

joshnguyen99 commented Jan 18, 2024

markwhiting commented Jan 18, 2024

GPT Statement cleaning #9

GPT Statement cleaning #9

Comments

markwhiting commented Oct 3, 2023 • edited by amirrr Loading

Pipeline:

Meta tasks:

amirrr commented Jan 8, 2024

joshnguyen99 commented Jan 18, 2024

markwhiting commented Jan 18, 2024

markwhiting commented Oct 3, 2023 •

edited by amirrr

Loading