Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statement Normalization Pipeline using OpenAI #24

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

dankim444
Copy link
Contributor

Description

This PR introduces a new statement normalization pipeline, cleans the remaining original statements in the raw_statements directory, and introduces minor changes to different files to streamline text extraction (specifically extracting the language code) from filenames. The criteria for normalization is as follows:

  • The first letter of the statement must be capitalized (if applicable to the language).
  • Leading and trailing punctuation is removed.
  • The statement ends in the appropriate full-stop punctuation native to the language.

The normalization pipeline leverages OpenAI, and the news_statements and observable statements were cleaned using gpt-4o while email_statements (due to the size of the files) files were cleaned with gpt-3.5-turbo. During this process, I noticed several differences in performance between the two models. Specifically, gpt-4o was more consistent in not changing the original capitalization of proper nouns, altering the original vocabulary, and not introducing any additional punctuation; whereas gpt-3.5-turbo would make changes despite being explicitly instructed not to in the system prompt. When merged, this PR will close Watts-Lab/commonsense-platform#150, ensuring consistent rendering of statements on the commonsense platform's UI.

New files

  • normalize_statements_openai.py: script that cleans statements files that have yet to be cleaned in the raw_statements directory.
  • remove_duplicates_after_normalization.py: script that handles duplicates caused by running the normalize_statements_openai.py script.

Changes

  • email_statements, news_statements, observable statements
  • Translate Statements and Remove Any Duplicates workflow: Added a third job 'normalize-statements' that cleans the statement files after they have been translated and removes potential duplicates from translations.
  • calculate_translation_cost.py: updated the way the language code is extracted from the filename and how filenames are processed.
  • remove_duplicates.py: minor change to documentation.
  • show_groups_of_duplicates: removed 'lng' as a column to avoid redundancy.
  • translate_statements_aws.py: changed how filenames are processed and how language code is extracted.
  • README.md: included instructions on naming convention of files and translation of files.

Testing

I acted as a "human-in-the-loop" to verify OpenAI's outputs. I used an online Diffchecker tool (https://www.diffchecker.com/) to compare changes made from the original file to the new file. I also used OpenAI playground to verify the system prompt.

Important note

To ensure more consistent output from OpenAI, I recommend using gpt-4o or possibly gpt-4o-mini to normalize the statements. In particular, gpt-3.5-turbo would sometimes remove the capitalization of proper nouns, alter some vocabulary and thereby change the nuanced meaning of some statements, and introduce unintended punctuation. I directly address all these in the system prompt; however, it is open to improvement.

dankim444 and others added 22 commits July 18, 2024 14:53
Files changed:
M	raw_statements/email_statements.csv
M	raw_statements/email_statements_ar.csv
M	raw_statements/email_statements_bn.csv
M	raw_statements/email_statements_es.csv
M	raw_statements/email_statements_fr.csv
M	raw_statements/email_statements_hi.csv
M	raw_statements/email_statements_ja.csv
M	raw_statements/email_statements_pt.csv
M	raw_statements/email_statements_ru.csv
M	raw_statements/email_statements_zh.csv
M	raw_statements/news_statements_amir.csv
M	raw_statements/news_statements_amir_ar.csv
M	raw_statements/news_statements_amir_hi.csv
M	raw_statements/observable_gpt4o_ar.csv
@dankim444 dankim444 linked an issue Aug 17, 2024 that may be closed by this pull request
@dankim444 dankim444 requested a review from amirrr August 17, 2024 18:28
@markwhiting
Copy link
Member

markwhiting commented Aug 19, 2024

Great. Can we switch to 4o for everything? (or have you already)

Copy link

github-actions bot commented Oct 3, 2024

Translation Cost Calculation

cleaned_statements_en.csv still needs to be translated into 9 new languages. This would require translating 12141 characters.
It will cost approximately $0.18 to complete these translations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Treatment of statements is inconsistent
3 participants