Skip to content

Conversation

@SemyonEpanov
Copy link
Collaborator

No description provided.

if field_response not in df.columns:
df[field_response] = ""
if field_response not in df.columns:
if field_ans not in df.columns:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!

Return JSON only with:
{{"answer":"{letters}","rationale":"1-3 sentences (concise)","key_steps":["fact1","fact2","fact3"]}}
Answer the MCQ briefly and factually (no step-by-step reasoning).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I thought we wanted to elicit step-by-step reasoning

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use step-by-step, it makes sense to use thinking, which will be very expensive on the mmlu-pro (try changing the prompt and setting the -1 flag for experiment)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mean

            thinking_config=types.ThinkingConfig(thinking_budget=0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need to better align on the goal of the experiment then first. Could you add a design doc to docs with: hypothesis, execution plan, expected resutls

},
}

def process_tsv(tsv_path, out_jsonl, limit=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unify it with the existing distill_on_dataset?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can, but, I think the CLI call is more useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use it as a function call from a script in experiements instead for a plain CLI call then?

@SemyonEpanov
Copy link
Collaborator Author

I accidentally renamed the file and made changes in one commit (sorry about that).

In general: I changed gemini to openrouter, changed the logic to step-by-step reasoning, and left key_steps as the summation.

In the future, I want to merge branch a into branch c, since the first part of c duplicates a

@@ -0,0 +1,17 @@
1) **Main point**

Obtain a synthetic dataset (answers + brief explanations + analysis of erroneous answers + CoT tokens) for training subsequent models.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually want to elicit the full reasoning chain, don't we?

Could you also add why we want to do it? AFAIU, we want to fine-tune small models on different versions of the distilled CoT and compare the performance. Right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need brief explanations?

from core.prompts.mmlu_branches_aug import *

# defaults
DEFAULT_MODEL = os.getenv("OPENROUTER_MODEL", "deepseek/deepseek-r1:free")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we pass it as config? As discussed during before, we want reproducible results and it is extremely easy to forget what options we used if we pass them as env or CLI args

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a default argument in the file, and the function itself accepts and works with arguments. Therefore, the use of the config is the responsibility of the person using the code.

def synth_on_dataset(
    in_filename: str,
    out_jsonl: str,
    model: str = DEFAULT_MODEL,
    max_tokens: int = DEFAULT_MAX_TOKENS,
    dump_every: int = DUMP_EVERY,
    limit: int | None = None,
    branches: tuple[str, ...] = DEFAULT_BRANCHES
):

CHUNK_SIZE = int(os.getenv("SYNTH_CHUNK_SIZE", "16"))
DUMP_EVERY = int(os.getenv("SYNTH_DUMP_EVERY", "10"))

ALL_LETTERS = [chr(c) for c in range(ord("A"), ord("Z")+1)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss this in a conference call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SemyonEpanov we alrady have option_ids

pass
j = j or {}

if reasoning_text and "thinking" not in j:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help me understand what we are doing here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation of the received json from LLM

record_in = _build_record_in(row_dict, question, choices, letters, gold, model)
jobs.append((row.Index, "A", {"question": question, "choices": choices, "gold": gold, "record_in": record_in, "letters": letters}))
jobs.append((row.Index, "B", {"question": question, "choices": choices, "gold": gold, "record_in": record_in, "letters": letters}))
jobs.append((row.Index, "C", {"question": question, "choices": choices, "gold": gold, "record_in": record_in, "letters": letters}))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we first get A and B? And then run C on top of A as you propose din the chat before?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. In the new version, it's refactored

Return JSON ONLY with the following schema:
{{
"answer": "{letters}",
"rationale": "concise 1-2 sentence justification (no fluff)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we ask for the final answer straight away if we are using a reasoning model and we can extract its reasoning chain?

Return JSON only:
{{"correct_answer":"{letters}",
"why_correct": "step-by-step reasoning showing why the gold option is correct",
"distractor_analysis": {distractor_tpl} }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help me understand distractor_analysis vs why_correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Distractor_analysis explains all answer options. Why_correct explains the correct answer

distractor_tpl = "{" + ", ".join([f'"{L}":"..."' for L in letters if L != gold]) + "}"

prompt1 = (p_json_guardrails() + "\n" + base + p_branch_c_one(allowed))
j1 = _openrouter_json("You return STRICT JSON.", user_prompt=prompt1, model=model, max_tokens=max_tokens)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we include p_json_guardrails as system prompt? What is the point of repeating it?

parts = [p.strip().strip("'").strip('"') for p in s.split(",")]
return [p for p in parts if p]

def render_mc_prompt(question, choices, letters):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we replicate (or better generalize and re-use) prompts in mmlu_single_token_answer.py?

@@ -0,0 +1,31 @@
from pathlib import Path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a dedicated file for each one of the models we want to experiement with? Like experiment_aug_mmlu_qwen3_235B.py
Could you also compile a list of SOTA models we want to run this with?

@SemyonEpanov SemyonEpanov force-pushed the feat/data-aug-think branch 2 times, most recently from 43f426e to bee9b3d Compare November 8, 2025 21:28
@@ -0,0 +1,120 @@
{"input": {"row_id": 0, "question": "Which of the following criticisms of Llewellyn's distinction between the grand and formal styles of legal reasoning is the most compelling?", "options": {"A": "There is no distinction between the two forms of legal reasoning.", "B": "Judges are appointed to interpret the law, not to make it.", "C": "It is misleading to pigeon-hole judges in this way.", "D": "Judicial reasoning is always formal."}, "gold": "C", "model": "moonshotai/kimi-k2-thinking", "branch": "A"}, "output": {"error": "Error code: 429 - {'error': {'message': 'Provider returned error', 'code': 429, 'metadata': {'raw': 'moonshotai/kimi-k2-thinking is temporarily rate-limited upstream. Please retry shortly, or add your own key to accumulate your rate limits: https://openrouter.ai/settings/integrations', 'provider_name': 'Parasail'}}, 'user_id': 'user_33KptDxZoHBCy6yNXVDTvErwjNn'}"}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rate limit error


def parse_options(s):
try:
lst = ast.literal_eval(s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return ""

def _subject_from_row(row_dict: dict) -> str | None:
return (row_dict.get("category") or row_dict.get("subject") or row_dict.get("src") or "").strip() or None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opts = "\n".join([f"{oid}. {text}".strip() for oid, text in zip(option_ids, options)])
return [
{"role": "user", "content": f"Question: {question.strip()}\nOptions:\n{opts}\n"},
{"role": "assistant", "content": f"Answer: {model_letter}"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add reasoning chain to the assistant output:

{"role": "assistant", "content": f"{previous_reasoning}\nAnswer: {model_letter}"},

@SemyonEpanov SemyonEpanov merged commit 41a3cf2 into main Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants