Getting good CoT - Eval models & good CoT data #17
Replies: 4 comments 16 replies
-
That is one approach, particularly if you have existing data. However, I have never had access to good data, so I've always had to synthesize data. I suspect that we can get started (bootstrap) with synthetic data. As for training a reward predictor, my original idea that I discussed in this video ( https://youtu.be/FJTZP7ZdQf0?si=gc_Y0QhF5g9Uazsd ) was to create a "surface plate" series of models. It's very much like a GAN but with more models, such as:
Now this is all based on finetuning pipelines, not RL. However, I've had excellent results in the past with finetuning. Would be happy to adapt all this to RL. |
Beta Was this translation helpful? Give feedback.
-
you could use monte carlo tree search for the 'reasoner' model and a smaller but fine-tuned 'grader' either prometheus or custom made fact checker with the lfud dataset (link below). you will run into the algorithmic instability problem with mcts. mcts is very good in a closed end scenario (ie. chess, go, etc. ) but struggles with open-ended chain of thought, that is, you don't know how many steps you need to get to the answer. this therefore needs to be estimated as well. now, you can estimate this with an other llm (ie. 'proxy estimator' model), but personally i think it needs to be human-led for the first 500 or so questions. so the steps are as follows:
once you have a useable 'grader' and 'reasoner' use the deepmind method of temporal learning to keep increasing quality with more and better quality synthetic data. ie. run to models against their older version. |
Beta Was this translation helpful? Give feedback.
-
i'm not sure if you lads came across a google deepmind research paper released 5 days ago. it is basically challenging the current cot and self-correction abilities of the models. this finding probably important when you build the dataset for fine-tuning. claude and other may not be able to give you diverse-enough self-correction examples as all these models (except probably o1) have a limited capability to challenge to current chain of thought and reflect on it properly. i.e. models will not think outside the box unless you push them. as a result, you will not get good quality synthetic data. |
Beta Was this translation helpful? Give feedback.
-
Your project sounds awesome and I would love to support it. I had been talking to others about similar ideas yesterday and I think one of the most important crucial elements will be to have a model that can evaluate a given chain of thought and tell you how good or how broken it is. And a second important element would be a reward function that could tell you if you have a certain target that you want to predict and the prediction from your model to take the prediction and the target and tell you how good the prediction matches the target semantically not on a token level but in the core idea Then you could do something like Stanford self-taught reasoner, but not only for tokens as targets but for ideas as targets. And if you get a chain of thought that predicts the correct result then there is still the question if the chain of thought that led to the correct prediction is actually of high quality or maybe in itself weird or broken and just had some luck with getting the correct answer. So if we could solve this getting good reward models and good chain of thought evaluators this would help a lot.
One promising inspiration could be the Prometheus paper ( https://arxiv.org/abs/2310.08491) that showed that you can take a much smaller model like 13 billion parameter for example and get it to GPT-4 performance on grading responses to instructions. if you give it a criteria catalog and some additional information. For example, a reference answer where you know the quality. Getting such criteria catalogs could be achieved by first taking some high quality data then generating an instruction that would produce this data and then generating the criteria catalog so that the response to the instruction would match a certain level in this criteria catalog. So you do not start with the instruction and the criteria catalog, you start with the data that you know is pretty good. And by playing around with tricks like this I think we could get pretty far for evaluating chains of thought.
One possible solution I could see could be that you have some interesting data point and take this as the solution and then you tell it to write an instruction that would require some chain of thought reasoning and in the end arrive at this target data point that you already have and start to begin with. And when you have this instruction and the target then you let it basically generate a plausible chain of thought that would lead to this conclusion. The quest The question is if these chains of thought where you know the target and you know the instruction and you just have to make up some plausible chain of thought if they are higher quality than the chains of thought you would get if you would start with the instruction only. If this would be the case we could generate lots of data.
Beta Was this translation helpful? Give feedback.
All reactions