-
Notifications
You must be signed in to change notification settings - Fork 3
<feat WIP>: augmenting mmlu #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| if field_response not in df.columns: | ||
| df[field_response] = "" | ||
| if field_response not in df.columns: | ||
| if field_ans not in df.columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch!
| Return JSON only with: | ||
| {{"answer":"{letters}","rationale":"1-3 sentences (concise)","key_steps":["fact1","fact2","fact3"]}} | ||
| Answer the MCQ briefly and factually (no step-by-step reasoning). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? I thought we wanted to elicit step-by-step reasoning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use step-by-step, it makes sense to use thinking, which will be very expensive on the mmlu-pro (try changing the prompt and setting the -1 flag for experiment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i mean
thinking_config=types.ThinkingConfig(thinking_budget=0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we need to better align on the goal of the experiment then first. Could you add a design doc to docs with: hypothesis, execution plan, expected resutls
| }, | ||
| } | ||
|
|
||
| def process_tsv(tsv_path, out_jsonl, limit=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unify it with the existing distill_on_dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can, but, I think the CLI call is more useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we use it as a function call from a script in experiements instead for a plain CLI call then?
|
I accidentally renamed the file and made changes in one commit (sorry about that). In general: I changed gemini to openrouter, changed the logic to step-by-step reasoning, and left key_steps as the summation. In the future, I want to merge branch a into branch c, since the first part of c duplicates a |
| @@ -0,0 +1,17 @@ | |||
| 1) **Main point** | |||
|
|
|||
| Obtain a synthetic dataset (answers + brief explanations + analysis of erroneous answers + CoT tokens) for training subsequent models. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually want to elicit the full reasoning chain, don't we?
Could you also add why we want to do it? AFAIU, we want to fine-tune small models on different versions of the distilled CoT and compare the performance. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need brief explanations?
| from core.prompts.mmlu_branches_aug import * | ||
|
|
||
| # defaults | ||
| DEFAULT_MODEL = os.getenv("OPENROUTER_MODEL", "deepseek/deepseek-r1:free") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we pass it as config? As discussed during before, we want reproducible results and it is extremely easy to forget what options we used if we pass them as env or CLI args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a default argument in the file, and the function itself accepts and works with arguments. Therefore, the use of the config is the responsibility of the person using the code.
def synth_on_dataset(
in_filename: str,
out_jsonl: str,
model: str = DEFAULT_MODEL,
max_tokens: int = DEFAULT_MAX_TOKENS,
dump_every: int = DUMP_EVERY,
limit: int | None = None,
branches: tuple[str, ...] = DEFAULT_BRANCHES
):
| CHUNK_SIZE = int(os.getenv("SYNTH_CHUNK_SIZE", "16")) | ||
| DUMP_EVERY = int(os.getenv("SYNTH_DUMP_EVERY", "10")) | ||
|
|
||
| ALL_LETTERS = [chr(c) for c in range(ord("A"), ord("Z")+1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-use what we have in https://github.com/LabARSS/reasoning-fine-tune/blob/85cc151cdfcac6a5ec409a9f2583486318fe7ed0/src/reasoning_fine_tune/prompts/mmlu_single_token_answer.py#L34?
Extract it in a separate file for better readibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss this in a conference call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SemyonEpanov we alrady have option_ids
| pass | ||
| j = j or {} | ||
|
|
||
| if reasoning_text and "thinking" not in j: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help me understand what we are doing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validation of the received json from LLM
| record_in = _build_record_in(row_dict, question, choices, letters, gold, model) | ||
| jobs.append((row.Index, "A", {"question": question, "choices": choices, "gold": gold, "record_in": record_in, "letters": letters})) | ||
| jobs.append((row.Index, "B", {"question": question, "choices": choices, "gold": gold, "record_in": record_in, "letters": letters})) | ||
| jobs.append((row.Index, "C", {"question": question, "choices": choices, "gold": gold, "record_in": record_in, "letters": letters})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we first get A and B? And then run C on top of A as you propose din the chat before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's right. In the new version, it's refactored
| Return JSON ONLY with the following schema: | ||
| {{ | ||
| "answer": "{letters}", | ||
| "rationale": "concise 1-2 sentence justification (no fluff)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we ask for the final answer straight away if we are using a reasoning model and we can extract its reasoning chain?
| Return JSON only: | ||
| {{"correct_answer":"{letters}", | ||
| "why_correct": "step-by-step reasoning showing why the gold option is correct", | ||
| "distractor_analysis": {distractor_tpl} }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help me understand distractor_analysis vs why_correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Distractor_analysis explains all answer options. Why_correct explains the correct answer
| distractor_tpl = "{" + ", ".join([f'"{L}":"..."' for L in letters if L != gold]) + "}" | ||
|
|
||
| prompt1 = (p_json_guardrails() + "\n" + base + p_branch_c_one(allowed)) | ||
| j1 = _openrouter_json("You return STRICT JSON.", user_prompt=prompt1, model=model, max_tokens=max_tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we include p_json_guardrails as system prompt? What is the point of repeating it?
| parts = [p.strip().strip("'").strip('"') for p in s.split(",")] | ||
| return [p for p in parts if p] | ||
|
|
||
| def render_mc_prompt(question, choices, letters): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we replicate (or better generalize and re-use) prompts in mmlu_single_token_answer.py?
| @@ -0,0 +1,31 @@ | |||
| from pathlib import Path | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have a dedicated file for each one of the models we want to experiement with? Like experiment_aug_mmlu_qwen3_235B.py
Could you also compile a list of SOTA models we want to run this with?
43f426e to
bee9b3d
Compare
| @@ -0,0 +1,120 @@ | |||
| {"input": {"row_id": 0, "question": "Which of the following criticisms of Llewellyn's distinction between the grand and formal styles of legal reasoning is the most compelling?", "options": {"A": "There is no distinction between the two forms of legal reasoning.", "B": "Judges are appointed to interpret the law, not to make it.", "C": "It is misleading to pigeon-hole judges in this way.", "D": "Judicial reasoning is always formal."}, "gold": "C", "model": "moonshotai/kimi-k2-thinking", "branch": "A"}, "output": {"error": "Error code: 429 - {'error': {'message': 'Provider returned error', 'code': 429, 'metadata': {'raw': 'moonshotai/kimi-k2-thinking is temporarily rate-limited upstream. Please retry shortly, or add your own key to accumulate your rate limits: https://openrouter.ai/settings/integrations', 'provider_name': 'Parasail'}}, 'user_id': 'user_33KptDxZoHBCy6yNXVDTvErwjNn'}"}} | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rate limit error
|
|
||
| def parse_options(s): | ||
| try: | ||
| lst = ast.literal_eval(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why try-catch? Just literal eval works https://github.com/LabARSS/reasoning-fine-tune/blob/34d415f501ab894a561c50f7f6ab246b9ce3a179/src/experiments/distill/deepseek_v3_mmlu.py#L16
| return "" | ||
|
|
||
| def _subject_from_row(row_dict: dict) -> str | None: | ||
| return (row_dict.get("category") or row_dict.get("subject") or row_dict.get("src") or "").strip() or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's stick with base_cluster like in other experiements https://github.com/LabARSS/reasoning-fine-tune/blob/34d415f501ab894a561c50f7f6ab246b9ce3a179/src/experiments/distill/deepseek_v3_mmlu.py#L14
| opts = "\n".join([f"{oid}. {text}".strip() for oid, text in zip(option_ids, options)]) | ||
| return [ | ||
| {"role": "user", "content": f"Question: {question.strip()}\nOptions:\n{opts}\n"}, | ||
| {"role": "assistant", "content": f"Answer: {model_letter}"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add reasoning chain to the assistant output:
{"role": "assistant", "content": f"{previous_reasoning}\nAnswer: {model_letter}"},
No description provided.