Skip to content

Commit

Permalink
🚨 implement first preprocessing steps with further testing needed
Browse files Browse the repository at this point in the history
  • Loading branch information
soeren227 committed Jan 22, 2024
1 parent a419f05 commit d349b8c
Show file tree
Hide file tree
Showing 3 changed files with 69 additions and 45 deletions.
4 changes: 3 additions & 1 deletion test.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,9 @@
from tracex.extraction.prototype import preprocessing as pre

text = open(u.input_path / "journey_test_preprocessing_2.txt").read()
preprocessed_text = pre.refactor_input_journey_time(text)
preprocessed_text = pre.preprocessing_spellcheck(text)
preprocessed_text = pre.preprocessing_condense(preprocessed_text)
preprocessed_text = pre.preprocessing_identify_time_specification(preprocessed_text)
# df = ih.convert_text_to_bulletpoints(text)
# print(df)
# df = ih.add_start_dates(text, df)
Expand Down
32 changes: 28 additions & 4 deletions tracex/extraction/prototype/preprocessing.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,36 @@
from . import prompts as p


def refactor_input_journey_time(text):
def preprocessing_identify_time_specification(text):
"""Preprocesses the input so that mentioned durations and times are clearly displayed in the output."""
messages = [
{"role": "system", "content": p.REFACTOR_INPUT_JOURNEY_TIME_CONTEXT_3},
{"role": "user", "content": p.REFACTOR_INPUT_JOURNEY_TIME_PROMPT + text},
{"role": "assistant", "content": p.REFACTOR_INPUT_JOURNEY_TIME_ANSWER},
{"role": "system", "content": p.SYSTEM_ROLE_PROMPT_IDENTIFY},
{"role": "user", "content": p.USER_ROLE_PROMPT_IDENTIFY + text},
{"role": "assistant", "content": p.ASSISTANT_ROLE_PROMPT_IDENTIFY},
]
preprocessed_text = u.query_gpt(messages)

return preprocessed_text


def preprocessing_spellcheck(text):
"""Preprocesses the input so that the text is spellchecked and grammatically correct."""
messages = [
{"role": "system", "content": p.SYSTEM_ROLE_PROMPT_SPELLCHECK},
{"role": "user", "content": p.USER_ROLE_PROMPT_SPELLCHECK + text},
{"role": "assistant", "content": p.ASSISTANT_ROLE_PROMPT_SPELLCHECK},
]
preprocessed_text = u.query_gpt(messages)

return preprocessed_text


def preprocessing_condense(text):
"""Preprocesses the input so that the text is condensed and shortened."""
messages = [
{"role": "system", "content": p.SYSTEM_ROLE_PROMPT_CONDENSE},
{"role": "user", "content": p.USER_ROLE_PROMPT_CONDENSE + text},
{"role": "assistant", "content": p.ASSISTANT_ROLE_PROMPT_CONDENSE},
]
preprocessed_text = u.query_gpt(messages)

Expand Down
78 changes: 38 additions & 40 deletions tracex/extraction/prototype/prompts.py
Original file line number Diff line number Diff line change
Expand Up @@ -260,54 +260,52 @@ def life_circumstances_prompt(sex):
Please extract the following location of the text without changing the given date format:
"""

REFACTOR_INPUT_JOURNEY_TIME_CONTEXT_1 = """
You are an expert text editor tasked with identifying vague time
specifications in the text and converting them into specific dates. Your edits should focus solely on time
references, without altering any other part of the text. Also, do not add any commentary or the like that we did not
ask for. Always conclude your edits with 'Goodbye'.
For example, the text 'At the end of April, I started
experiencing mild symptoms.' should be converted to 'On April 30, 2021, I started experiencing mild symptoms.'
Another example is the text 'In the next days, I waited for the symptoms to fade away.' should be converted to 'On
May 1, 2021, I waited for the symptoms to fade away.' One more example is the text 'I was then hospitalized for two
weeks.' should be converted to 'I was then hospitalized from May 1, 2021, to May 15, 2021.' Ensure that the dates are
contextually appropriate, maintain chronological consistency, and consider cultural and regional date formats,
if relevant.
SYSTEM_ROLE_PROMPT_IDENTIFY = """
You are an AI system that assists in preprocessing text data. Your task is to identify time specifications and durations in the text. Your focus should be solely on highlighting time references, without altering any other part of the
text. Highlight time specifications and durations you find with $$$time specification$$$, where "time specification" would be something like "...on June 2nd I found out that..." or "...12 weeks..." or "nine months".
You must consider all parts of the text, where there is a quantifier like a number (either written out or in digits) or less specific e.g "many" accompanied by a time unit (e.g. days, weeks, months, years, hours, minutes, seconds and so on).
Always conclude your edits with 'Goodbye'.
"""

REFACTOR_INPUT_JOURNEY_TIME_CONTEXT_2 = """You are an expert text editor specialized in converting vague date
references into specific dates, ensuring chronological accuracy. Your task is to identify dates mentioned in the
text, clarify them, and adjust any related time references accordingly. Your edits should focus only on the dates and
time references without altering the rest of the text.
Example:
Original: "The program started in June 2022, and nine months into the program, I had a fever."
Refactored: "The program started in June 2022, and in March 2023, I had a fever."
Remember to calculate the time intervals accurately and to express the dates in a specific and clear format. Ensure
that the chronological order is maintained, and the dates are consistent with the narrative of the text."""
USER_ROLE_PROMPT_IDENTIFY = """
As a user, I provide the raw text data that needs preprocessing. The text contains various time
specifications that need to be highlighted by framing them with $$$ $$$. I expect
the AI system to handle this task efficiently and accurately, without altering any other part of the text. This is
the text: \n\n
"""

REFACTOR_INPUT_JOURNEY_TIME_CONTEXT_3 = """
You are an expert text editor specialized in identifying time-related specifications in the text. It is your job to
find every mention of time and list them in chronological order as bulletpoints below the text. Other then that, you
should return the original text unaltered.
ASSISTANT_ROLE_PROMPT_IDENTIFY = """
As an AI assistant, I take the raw text data provided by the user and preprocess it
according to the task defined by the system. I identify all time specifications in the text and highlight them.
Once the task is completed, I signal the end of my edits with 'Goodbye'.
"""

REFACTOR_INPUT_JOURNEY_TIME_PROMPT = """Remember to ensure time consistency so that there are no contradictions in
the dates or unexpected jumps in time. Consider the context of the events when assigning specific dates. Here is the
text where you should identify any mentions of time-related specifications and formulate specific dates:"""
SYSTEM_ROLE_PROMPT_SPELLCHECK = """
You are an AI system that assists in preprocessing text data. Your task is to spellcheck the text and correct any grammatical errors.
Take any necessary steps to ensure that the text is grammatically correct and spelled correctly. You must not alter the meaning of the text.
"""

REFACTOR_INPUT_JOURNEY_TIME_ANSWER = """
Here are the examples of how the text should be refactored:
USER_ROLE_PROMPT_SPELLCHECK = """
As a user, I provide the raw text data that needs preprocessing. The text contains various grammatical errors and spelling errors that need to be corrected. I expect
the AI system to handle this task efficiently and accurately, without altering the meaning of the text. This is the text: \n\n
"""

Original: 'At the end of April, I started experiencing mild symptoms.'
Refactored: 'On April 30, 2021, I started experiencing mild symptoms.'
ASSISTANT_ROLE_PROMPT_SPELLCHECK = """
As an AI assistant, I take the raw text data provided by the user and preprocess it.
"""

Original: 'In the next days, I waited for the symptoms to fade away.'
Refactored: 'On May 1, 2021, I waited for the symptoms to fade away.'
SYSTEM_ROLE_PROMPT_CONDENSE = """
You are an AI system that assists in preprocessing text data. Your task is to condense the text and shorten it.
Take any necessary steps to ensure that the text is condensed and shortened. You must not alter the meaning of the text or lose any important information.
Especially important information would be information about the disease, the course of the disease, the symptoms, the treatment, the hospitalization and the recovery.
"""

Original: 'I was then hospitalized for two weeks.'
Refactored: 'I was then hospitalized from May 1, 2021, to May 15, 2021.'
USER_ROLE_PROMPT_CONDENSE = """
As a user, I provide the raw text data that needs preprocessing. The text contains various parts that need to be condensed and shortened. I expect
every part of the output to have relevant information about the disease, the course of the disease, the symptoms, the treatment, the hospitalization and the recovery.
This is the text: \n\n
"""

All dates provided must be contextually appropriate, maintain chronological consistency, and consider cultural and regional date formats when relevant.
ASSISTANT_ROLE_PROMPT_CONDENSE = """
As an AI assistant, I take the raw text data provided by the user and preprocess it.
"""

0 comments on commit d349b8c

Please sign in to comment.