<feat WIP>: augmenting mmlu #29

SemyonEpanov · 2025-09-23T18:31:11Z

No description provided.

fxlrnrpt · 2025-09-24T11:44:21Z

src/core/distillation/distill.py

    if field_response not in df.columns:
        df[field_response] = ""
-    if field_response not in df.columns:
+    if field_ans not in df.columns:


Great catch!

fxlrnrpt · 2025-09-24T11:45:26Z

src/core/prompts/mmlu_branches_aug.py

+            Return JSON only with:
+            {{"answer":"{letters}","rationale":"1-3 sentences (concise)","key_steps":["fact1","fact2","fact3"]}}
+
+            Answer the MCQ briefly and factually (no step-by-step reasoning).


Why? I thought we wanted to elicit step-by-step reasoning

If we use step-by-step, it makes sense to use thinking, which will be very expensive on the mmlu-pro (try changing the prompt and setting the -1 flag for experiment)

i mean

thinking_config=types.ThinkingConfig(thinking_budget=0)

I guess we need to better align on the goal of the experiment then first. Could you add a design doc to docs with: hypothesis, execution plan, expected resutls

fxlrnrpt · 2025-09-24T11:48:26Z

src/core/distillation/synth-aug-mmlu.py

+        },
+    }
+
+def process_tsv(tsv_path, out_jsonl, limit=None):


Unify it with the existing distill_on_dataset?

Yes, we can, but, I think the CLI call is more useful.

Shall we use it as a function call from a script in experiements instead for a plain CLI call then?

…lexity-aware-fine-tuning into feat/data-aug-think

SemyonEpanov · 2025-09-28T19:54:16Z

I accidentally renamed the file and made changes in one commit (sorry about that).

In general: I changed gemini to openrouter, changed the logic to step-by-step reasoning, and left key_steps as the summation.

In the future, I want to merge branch a into branch c, since the first part of c duplicates a

fxlrnrpt · 2025-09-29T09:18:02Z

docs/explain-mmlu.md

@@ -0,0 +1,17 @@
+1) **Main point**
+
+Obtain a synthetic dataset (answers + brief explanations + analysis of erroneous answers + CoT tokens) for training subsequent models.


We actually want to elicit the full reasoning chain, don't we?

Could you also add why we want to do it? AFAIU, we want to fine-tune small models on different versions of the distilled CoT and compare the performance. Right?

Do we still need brief explanations?

fxlrnrpt · 2025-09-29T09:19:26Z

src/core/distillation/synth_aug_mmlu.py

+from core.prompts.mmlu_branches_aug import *
+
+# defaults
+DEFAULT_MODEL = os.getenv("OPENROUTER_MODEL", "deepseek/deepseek-r1:free")


Shall we pass it as config? As discussed during before, we want reproducible results and it is extremely easy to forget what options we used if we pass them as env or CLI args

It's just a default argument in the file, and the function itself accepts and works with arguments. Therefore, the use of the config is the responsibility of the person using the code.

def synth_on_dataset( in_filename: str, out_jsonl: str, model: str = DEFAULT_MODEL, max_tokens: int = DEFAULT_MAX_TOKENS, dump_every: int = DUMP_EVERY, limit: int | None = None, branches: tuple[str, ...] = DEFAULT_BRANCHES ):

fxlrnrpt · 2025-09-29T09:21:33Z

src/core/distillation/synth_aug_mmlu.py

+CHUNK_SIZE = int(os.getenv("SYNTH_CHUNK_SIZE", "16"))
+DUMP_EVERY = int(os.getenv("SYNTH_DUMP_EVERY", "10"))
+
+ALL_LETTERS = [chr(c) for c in range(ord("A"), ord("Z")+1)]


Re-use what we have in https://github.com/LabARSS/reasoning-fine-tune/blob/85cc151cdfcac6a5ec409a9f2583486318fe7ed0/src/reasoning_fine_tune/prompts/mmlu_single_token_answer.py#L34?
Extract it in a separate file for better readibility?

Let's discuss this in a conference call.

@SemyonEpanov we alrady have option_ids

fxlrnrpt · 2025-09-29T09:24:01Z

src/core/distillation/synth_aug_mmlu.py

+                pass
+    j = j or {}
+
+    if reasoning_text and "thinking" not in j:


Could you help me understand what we are doing here?

Validation of the received json from LLM

fxlrnrpt · 2025-09-29T09:26:58Z

src/core/distillation/synth_aug_mmlu.py

+        record_in = _build_record_in(row_dict, question, choices, letters, gold, model)
+        jobs.append((row.Index, "A", {"question": question, "choices": choices, "gold": gold, "record_in": record_in, "letters": letters}))
+        jobs.append((row.Index, "B", {"question": question, "choices": choices, "gold": gold, "record_in": record_in, "letters": letters}))
+        jobs.append((row.Index, "C", {"question": question, "choices": choices, "gold": gold, "record_in": record_in, "letters": letters}))


Should we first get A and B? And then run C on top of A as you propose din the chat before?

Yes, that's right. In the new version, it's refactored

fxlrnrpt · 2025-09-29T09:29:19Z

src/core/prompts/mmlu_branches_aug.py

+Return JSON ONLY with the following schema:
+{{
+  "answer": "{letters}",
+  "rationale": "concise 1-2 sentence justification (no fluff)",


Shall we ask for the final answer straight away if we are using a reasoning model and we can extract its reasoning chain?

fxlrnrpt · 2025-09-29T09:30:25Z

src/core/prompts/mmlu_branches_aug.py

+        Return JSON only:
+        {{"correct_answer":"{letters}",
+        "why_correct": "step-by-step reasoning showing why the gold option is correct",
+        "distractor_analysis": {distractor_tpl} }}


Could you help me understand distractor_analysis vs why_correct?

Distractor_analysis explains all answer options. Why_correct explains the correct answer

fxlrnrpt · 2025-10-22T17:18:36Z

src/core/distillation/synth_aug_mmlu.py

+    distractor_tpl = "{" + ", ".join([f'"{L}":"..."' for L in letters if L != gold]) + "}"
+
+    prompt1 = (p_json_guardrails() + "\n" + base + p_branch_c_one(allowed))
+    j1 = _openrouter_json("You return STRICT JSON.", user_prompt=prompt1, model=model, max_tokens=max_tokens)


Could we include p_json_guardrails as system prompt? What is the point of repeating it?

fxlrnrpt · 2025-10-22T17:21:02Z

src/core/distillation/synth_aug_mmlu.py

+        parts = [p.strip().strip("'").strip('"') for p in s.split(",")]
+        return [p for p in parts if p]
+
+def render_mc_prompt(question, choices, letters):


Could we replicate (or better generalize and re-use) prompts in mmlu_single_token_answer.py?

fxlrnrpt · 2025-10-22T17:23:33Z

src/experiments/distill/experiment_aug_mmlu_v0.py

@@ -0,0 +1,31 @@
+from pathlib import Path


Could we have a dedicated file for each one of the models we want to experiement with? Like experiment_aug_mmlu_qwen3_235B.py
Could you also compile a list of SOTA models we want to run this with?

fxlrnrpt · 2025-11-10T20:12:17Z

data/out/distillation/mmlu_pro_stem_synth_moonshotai_kimi_k2_thinking_a_f100.jsonl

@@ -0,0 +1,120 @@
+{"input": {"row_id": 0, "question": "Which of the following criticisms of Llewellyn's distinction between the grand and formal styles of legal reasoning is the most compelling?", "options": {"A": "There is no distinction between the two forms of legal reasoning.", "B": "Judges are appointed to interpret the law, not to make it.", "C": "It is misleading to pigeon-hole judges in this way.", "D": "Judicial reasoning is always formal."}, "gold": "C", "model": "moonshotai/kimi-k2-thinking", "branch": "A"}, "output": {"error": "Error code: 429 - {'error': {'message': 'Provider returned error', 'code': 429, 'metadata': {'raw': 'moonshotai/kimi-k2-thinking is temporarily rate-limited upstream. Please retry shortly, or add your own key to accumulate your rate limits: https://openrouter.ai/settings/integrations', 'provider_name': 'Parasail'}}, 'user_id': 'user_33KptDxZoHBCy6yNXVDTvErwjNn'}"}}


Rate limit error

fxlrnrpt · 2025-11-12T22:09:19Z

src/core/distillation/synth_aug_mmlu.py

+
+def parse_options(s):
+    try:
+        lst = ast.literal_eval(s)


Why try-catch? Just literal eval works https://github.com/LabARSS/reasoning-fine-tune/blob/34d415f501ab894a561c50f7f6ab246b9ce3a179/src/experiments/distill/deepseek_v3_mmlu.py#L16

fxlrnrpt · 2025-11-12T22:11:26Z

src/core/distillation/synth_aug_mmlu.py

+    return ""
+
+def _subject_from_row(row_dict: dict) -> str | None:
+    return (row_dict.get("category") or row_dict.get("subject") or row_dict.get("src") or "").strip() or None


Let's stick with base_cluster like in other experiements https://github.com/LabARSS/reasoning-fine-tune/blob/34d415f501ab894a561c50f7f6ab246b9ce3a179/src/experiments/distill/deepseek_v3_mmlu.py#L14

fxlrnrpt · 2025-11-12T22:22:14Z

src/core/prompts/mmlu_branches_aug.py

+    opts = "\n".join([f"{oid}. {text}".strip() for oid, text in zip(option_ids, options)])
+    return [
+        {"role": "user", "content": f"Question: {question.strip()}\nOptions:\n{opts}\n"},
+        {"role": "assistant", "content": f"Answer: {model_letter}"},


Let's add reasoning chain to the assistant output:

{"role": "assistant", "content": f"{previous_reasoning}\nAnswer: {model_letter}"},

Semyon Epanov added 2 commits September 23, 2025 21:00

<feat [WIP]>: augmenting mmlu

e5da774

fix distill and add prompts

933d0c0

fxlrnrpt reviewed Sep 24, 2025

View reviewed changes

SemyonEpanov added 3 commits September 27, 2025 21:50

design_doc for synth-mmlu data

89982f2

Merge branch 'feat/data-aug-think' of https://github.com/LabARSS/comp…

ed2981d

…lexity-aware-fine-tuning into feat/data-aug-think

removed CLI; replace gemini with openrouter

241ec6c

fxlrnrpt reviewed Sep 29, 2025

View reviewed changes

added an example and added a branch selection

4c5f8fe

fxlrnrpt reviewed Oct 22, 2025

View reviewed changes

SemyonEpanov and others added 5 commits October 31, 2025 03:40

WIP: add a-branch for first 100 rows (few models)

2915f68

add synth pipelines + demo synth (qwen, gptoss)

bee9b3d

unify a, b, c branches + add gpt-oss example

75f1d79

add qwen and kimi synth

f08b830

upd gpt-oss synth

7fb444a

SemyonEpanov force-pushed the feat/data-aug-think branch 2 times, most recently from 43f426e to bee9b3d Compare November 8, 2025 21:28

fxlrnrpt reviewed Nov 10, 2025

View reviewed changes

SemyonEpanov added 2 commits November 11, 2025 17:46

refactor prompts and logic of branch C

9b318f7

add extracting first letter from text

edb423e

fxlrnrpt reviewed Nov 12, 2025

View reviewed changes

SemyonEpanov added 4 commits November 13, 2025 17:31

add temperature

046cd1b

add tests for branch A

6fa10e8

remove useless files

a6ee566

add experiments with temperatures

1df643c

fxlrnrpt approved these changes Nov 13, 2025

View reviewed changes

SemyonEpanov added 2 commits November 13, 2025 20:32

only warnings log left

c2a094d

Rename mmlu_synth_runner.py example_mmlu_synth_runner.py

792140b

SemyonEpanov merged commit 41a3cf2 into main Nov 13, 2025

		@@ -0,0 +1,17 @@
		1) Main point

		Obtain a synthetic dataset (answers + brief explanations + analysis of erroneous answers + CoT tokens) for training subsequent models.

		@@ -0,0 +1,120 @@
		{"input": {"row_id": 0, "question": "Which of the following criticisms of Llewellyn's distinction between the grand and formal styles of legal reasoning is the most compelling?", "options": {"A": "There is no distinction between the two forms of legal reasoning.", "B": "Judges are appointed to interpret the law, not to make it.", "C": "It is misleading to pigeon-hole judges in this way.", "D": "Judicial reasoning is always formal."}, "gold": "C", "model": "moonshotai/kimi-k2-thinking", "branch": "A"}, "output": {"error": "Error code: 429 - {'error': {'message': 'Provider returned error', 'code': 429, 'metadata': {'raw': 'moonshotai/kimi-k2-thinking is temporarily rate-limited upstream. Please retry shortly, or add your own key to accumulate your rate limits: https://openrouter.ai/settings/integrations', 'provider_name': 'Parasail'}}, 'user_id': 'user_33KptDxZoHBCy6yNXVDTvErwjNn'}"}}

<feat WIP>: augmenting mmlu #29

<feat WIP>: augmenting mmlu #29

Uh oh!

Conversation

SemyonEpanov commented Sep 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SemyonEpanov commented Sep 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants