Minimal demonstration with RL & CoT repair with reflection & correction-model #19
andreaskoepf
started this conversation in
Research
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I like your initiative & would potentially be interested in collaborating, conducting experiments and share findings. I am actively tracking the system-2 research field with the Open-Thought initiatve: here.
Our primary challenges is to generate a large set of high-quality (logically correct) CoTs for diverse problems in a scalable fashion.
Things I would be interested in exploring/testing:
a) when training on correction data exclude the loss of all CoT tokens, only train the output (similar to masking user-queries and only training on assistant answers in chat models)
b) use a separete classification/repair model trained on wrong CoTs and their correction. During inference the CoT input part would be always passed as prompt and never generated by the correction model itself. The main reasoning model would never be intentionally trained on corrupted CoTs.
c) Potentially a single model could be trained for correction capabilities and main reasoning when the correction would always be correcty conditioned on a mode-specific sequence (or marker tokens), e.g. during correction model the original CoT would be passed AFTER a correction marker while valid main reasoning would follow a clean-thought marker. (basically you specify which model to use with a prefix)
(I had been discussing some ideas with christophschuhmann lately ... saw he also posted here - but I think our posts are complementary.)
Beta Was this translation helpful? Give feedback.
All reactions