Minimal demonstration with RL & CoT repair with reflection & correction-model #19

andreaskoepf · 2024-09-14T11:58:06Z

andreaskoepf
Sep 14, 2024

I like your initiative & would potentially be interested in collaborating, conducting experiments and share findings. I am actively tracking the system-2 research field with the Open-Thought initiatve: here.

Our primary challenges is to generate a large set of high-quality (logically correct) CoTs for diverse problems in a scalable fashion.

Things I would be interested in exploring/testing:

Generation of valid CoTs (logically correct) only given task generator and acceptance test. Here the main question is if in a RL setup wrong CoTs which accidentially lead to correct results can be eliminated/diminished over time. For examle this could be tested with simple arithmetic test-beds like large number addition or multiplication (potentially base-n to make it more challenging). The STaR paper mentions here a critical problem when acceptance tests an classic SFT is used: "it (sampling with higher temperature) substantially increases the likelihood of a correct answer despite incorrect reasoning, and training on bad or irrelevant reasoning prevents generalization. This is particularly clear in more structured tasks, like arithmetic, where the scratchpads that the model learns to produce with a higher-temperature sampling approach diverge into meaninglessness and cause the model to stagnate." - maybe this is less important when old traces (which potentially contains wrong conclusions) are discarded and fresh traces are generated in a RL setup - i.e. in the majority of cases wrong conclusions would not yield to the right result and the probabilty of generating those would then be discouraged via negative reward-feeback.
Detecting and correcting invalid CoTs: One challenges of self-correction is that a model has to be trained on wrong trajectories as input but it should not generate this data itself. Simple solution ideas could be:
a) when training on correction data exclude the loss of all CoT tokens, only train the output (similar to masking user-queries and only training on assistant answers in chat models)
b) use a separete classification/repair model trained on wrong CoTs and their correction. During inference the CoT input part would be always passed as prompt and never generated by the correction model itself. The main reasoning model would never be intentionally trained on corrupted CoTs.
c) Potentially a single model could be trained for correction capabilities and main reasoning when the correction would always be correcty conditioned on a mode-specific sequence (or marker tokens), e.g. during correction model the original CoT would be passed AFTER a correction marker while valid main reasoning would follow a clean-thought marker. (basically you specify which model to use with a prefix)

(I had been discussing some ideas with christophschuhmann lately ... saw he also posted here - but I think our posts are complementary.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimal demonstration with RL & CoT repair with reflection & correction-model #19

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Minimal demonstration with RL & CoT repair with reflection & correction-model #19

andreaskoepf Sep 14, 2024

Replies: 0 comments

andreaskoepf
Sep 14, 2024