- logprob standalone job
- mc-question standalone job
- wrapper: 0-100 judge
- integrate in viseval
- single-token eval
- train model on reward = -sft loss(f(sampled text))
- f(sampled text) = remove cot(sampled text)
- use very small model
- target text contains some hard tokens and some predictable ones
- the model should learn something like: "What is 123 * 456?" "The answer is reasoning... x
- we can initialize with synthetic sft
- merge chat.py, temporary_api.py
- add cpu instances
- customisable keep worker running for X mins
- delete API key revokes access