Getting good CoT - Eval models & good CoT data #17

christophschuhmann · 2024-09-14T10:26:42Z

christophschuhmann
Sep 14, 2024

Your project sounds awesome and I would love to support it. I had been talking to others about similar ideas yesterday and I think one of the most important crucial elements will be to have a model that can evaluate a given chain of thought and tell you how good or how broken it is. And a second important element would be a reward function that could tell you if you have a certain target that you want to predict and the prediction from your model to take the prediction and the target and tell you how good the prediction matches the target semantically not on a token level but in the core idea Then you could do something like Stanford self-taught reasoner, but not only for tokens as targets but for ideas as targets. And if you get a chain of thought that predicts the correct result then there is still the question if the chain of thought that led to the correct prediction is actually of high quality or maybe in itself weird or broken and just had some luck with getting the correct answer. So if we could solve this getting good reward models and good chain of thought evaluators this would help a lot.

One promising inspiration could be the Prometheus paper ( https://arxiv.org/abs/2310.08491) that showed that you can take a much smaller model like 13 billion parameter for example and get it to GPT-4 performance on grading responses to instructions. if you give it a criteria catalog and some additional information. For example, a reference answer where you know the quality. Getting such criteria catalogs could be achieved by first taking some high quality data then generating an instruction that would produce this data and then generating the criteria catalog so that the response to the instruction would match a certain level in this criteria catalog. So you do not start with the instruction and the criteria catalog, you start with the data that you know is pretty good. And by playing around with tricks like this I think we could get pretty far for evaluating chains of thought.

One possible solution I could see could be that you have some interesting data point and take this as the solution and then you tell it to write an instruction that would require some chain of thought reasoning and in the end arrive at this target data point that you already have and start to begin with. And when you have this instruction and the target then you let it basically generate a plausible chain of thought that would lead to this conclusion. The quest The question is if these chains of thought where you know the target and you know the instruction and you just have to make up some plausible chain of thought if they are higher quality than the chains of thought you would get if you would start with the instruction only. If this would be the case we could generate lots of data.

daveshap · 2024-09-14T11:07:59Z

daveshap
Sep 14, 2024
Maintainer

That is one approach, particularly if you have existing data. However, I have never had access to good data, so I've always had to synthesize data. I suspect that we can get started (bootstrap) with synthetic data.

As for training a reward predictor, my original idea that I discussed in this video ( https://youtu.be/FJTZP7ZdQf0?si=gc_Y0QhF5g9Uazsd ) was to create a "surface plate" series of models. It's very much like a GAN but with more models, such as:

EXPERT MODEL: this is the "reasoner" model
USER PROXY: A model that behaves like a user who is asking questions and providing feedback
RUBRIC GRADER: A model that grades the quality/accuracy of the EXPERT output
FORMATTER: A model that cleans up/formats the data correctly

Now this is all based on finetuning pipelines, not RL. However, I've had excellent results in the past with finetuning. Would be happy to adapt all this to RL.

15 replies

wayn-r Sep 14, 2024

The question is if these chains of thought where you know the target and you know the instruction and you just have to make up some plausible chain of thought if they are higher quality than the chains of thought you would get if you would start with the instruction only.

Would the chain of thought summaries from o1's ChatGPT implementation be the synthetic data? I mean this on the individual step level, but it sounds like it could elicit a common trend of the self-reflection instruction set of the model. Automation of extracting a common instruction set sounds plausible but painfully time-consuming.

russellballestrini Sep 14, 2024

Maybe but generally speaking we are better off using cheaper models to synthesize reasoning data. The new model is cost prohibitive in my opinion at the moment people are spending buckets just to benchmark the closed model. Synth data generations is going to use a lot of tokens so the cheaper the better, at least in the first pass. The gpt-o1 summaries of the chains are obsuscated too. We think we can query the latent space from cheaper models to gather training data. I least that's my take so far.

russellballestrini Sep 14, 2024

@daveshap apparently nobody knows how many valid and unique battleship boards there are. Need to use stats and numerical analysis?

Here's my best guess:

https://russell.ballestrini.net/exploring-battleship-board-generation/

FriarAletheus Sep 20, 2024

may i include reference to my own reasoner project (my debut beta release of an agent i've been working on for about a year)? It's designed to prime itself using strategies that exploit the latent creative, imaginative, rational, and intuitive features of language itself.

https://github.com/FriarAletheus/theAletheometer

i'm an artist and philosopher. i have zero IT or comp sci schooling, but this genAI explosion has been a trip!! i'm super excited to plug in if anyone has any ideas on how i can contribute or participate, i was goin to continue to catch up on discussions.

peace everyone and thanks!

i've included both chatGPTs aletheometer and claude's answers to @daveshap raspberry_experiements under sample_output in the root directory.

https://github.com/FriarAletheus/theAletheometer/tree/main/sample_output

citizenhicks Sep 23, 2024

i was wondering how does this compare to the openai chain of thought.

what we know is that:

there is a raw reasoning for each step,
then
the summary of that, which then passed back to the model again.
almost like a self prompting system with tool calling, custom tool being the self prompting mechanism.

in this workflow i would apply the reinforcement learning to each round of self prompt.

what i'm getting to (albeit slowly) that in order to do this you need to break the fine tuning systematic data into chunks or reasoning steps each:

{
'title': string,
'raw reasoning': string,
'summary of raw reasoning': string,
'is this the final step': boolean
}

russellballestrini · 2024-09-17T10:45:01Z

russellballestrini
Sep 17, 2024

0 replies

citizenhicks · 2024-09-21T08:58:29Z

citizenhicks
Sep 21, 2024

you could use monte carlo tree search for the 'reasoner' model and a smaller but fine-tuned 'grader' either prometheus or custom made fact checker with the lfud dataset (link below).

you will run into the algorithmic instability problem with mcts. mcts is very good in a closed end scenario (ie. chess, go, etc. ) but struggles with open-ended chain of thought, that is, you don't know how many steps you need to get to the answer. this therefore needs to be estimated as well. now, you can estimate this with an other llm (ie. 'proxy estimator' model), but personally i think it needs to be human-led for the first 500 or so questions.

so the steps are as follows:

fine-tune the 'grader' model.
manually estimate a reasoning steps for each question in your initial dataset (500 questions).
generate synthetic reasoning chain with mcts.
format it.
fine-tune the 'reasoner' model.

once you have a useable 'grader' and 'reasoner' use the deepmind method of temporal learning to keep increasing quality with more and better quality synthetic data. ie. run to models against their older version.
From this point onwards you could build a self-teaching mechanism.

https://arxiv.org/html/2404.04293v1

1 reply

russellballestrini Sep 21, 2024

Easy to read. Seems similar this repos hypothesis & process.

citizenhicks · 2024-09-25T11:06:08Z

citizenhicks
Sep 25, 2024

i'm not sure if you lads came across a google deepmind research paper released 5 days ago.

it is basically challenging the current cot and self-correction abilities of the models.
i haven't got the time to read it thru properly, but seems like the proposition is that current models cannot self-correct themselves properly as the foundation model is trained to stay in its local entropy minimum. for self-correction these models need to be nudged to step out of the local minimum (think outside the box) and can be taught to do so via reinforcement learning.

this finding probably important when you build the dataset for fine-tuning. claude and other may not be able to give you diverse-enough self-correction examples as all these models (except probably o1) have a limited capability to challenge to current chain of thought and reflect on it properly.

i.e. models will not think outside the box unless you push them. as a result, you will not get good quality synthetic data.

https://arxiv.org/pdf/2409.12917

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting good CoT - Eval models & good CoT data #17

{{title}}

Replies: 4 comments 16 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Getting good CoT - Eval models & good CoT data #17

Replies: 4 comments · 16 replies

daveshap Sep 14, 2024 Maintainer

Replies: 4 comments 16 replies

daveshap
Sep 14, 2024
Maintainer