Latent Capability Resurfacing — Code

Reference implementation of the prompt-free BOS-only self-training pipeline described in the paper, restricted to Qwen and LLaMA families.

Repository layout

.
├── setup.sh                          # one-time environment setup
├── pipeline.sh                       # interactive pipeline: gen + audit + subset + train
├── evaluate.sh                       # interactive eval over base model + checkpoints
├── requirements.txt
├── configs/
│   └── ds_z3.json                    # deepspeed ZeRO-3 config (from llamafactory)
└── src/
    ├── generate_synthetic.py         # vLLM unconditional sampler + shard merge
    ├── audit.py                      # n-gram contamination scanner
    └── create_subsets.py             # random training subset builder

Quick start

bash setup.sh                # asks for env name, creates conda env, installs deps
conda activate <env_name>
bash pipeline.sh             # generate → audit → subset → train
bash evaluate.sh             # evaluate base model + every checkpoint

Both interactive scripts ask every question up front, print a summary, ask one confirmation, then run.

What setup.sh does

Asks for a conda env name (defaults to lcr) and creates it with Python 3.11.
Installs torch 2.9.0 from the cu128 wheel index. Matches the versions verified to work on our reference machine (NVIDIA driver 550.x supporting CUDA 12.4, A100 GPUs).
Installs deepspeed 0.16.9.
Installs the remaining pinned requirements (requirements.txt).
Clones LLaMA-Factory to third_party/LLaMA-Factory at the pinned commit 246192abd2371d9729cb8cf256061d0a070517d4 and installs it editable (without the [torch] and [deepspeed] extras so our pins are preserved).
Runs a smoke test: imports, torch.cuda.is_available(), GPU count, a small GPU matmul, and CLI availability for llamafactory-cli and lm_eval.

The cu128 wheel needs an NVIDIA driver supporting CUDA ≥ 12.4. If your driver is older the script will warn and torch will likely fail to import.

Neither pipeline.sh nor evaluate.sh auto-activate any environment. Activate the conda env yourself before running them. If you forget, they will warn and prompt before continuing.

pipeline.sh

Four stages, each independently toggleable. For any stage whose output already exists, the script asks whether to reuse, re-run (overwriting), or abort.

Generate. vLLM sampling from <BOS> at temperature τ, fanned out across all visible GPUs with per-GPU seed seed + gpu_id * 1000. Per-GPU shards are stitched into a single corpus.
Audit. N-gram overlap scan against the test/dev/train splits of the benchmarks listed in the paper. Writes a clean corpus to a parallel directory tree.
Subset. Builds N independent training subsets at a fixed token budget. Each subset is built by drawing samples uniformly at random from the corpus (one sample per entry) until the token budget is reached.
Train. Continued pretraining with llamafactory-cli and ZeRO-3. The training subset is registered dynamically in data/dataset_info.json. The pipeline targets an effective batch size of 64 by default (paper Appendix C), auto-computing gradient accumulation steps based on per-device batch size and GPU count.

evaluate.sh

Standalone evaluation script. Scans output/ for training runs and their checkpoints, lets you pick which to evaluate, runs lm-evaluation-harness through the vLLM backend, writes results to evals/.

Benchmark options include the seven paper-core tasks (ARC-Challenge, MMLU, TruthfulQA, HellaSwag, GSM8K, MATH, HumanEval) plus ARC-Easy and WinoGrande, with two named suites:

Suite A — zero-shot on all paper-core benchmarks (default).
Suite B — standard few-shot: ARC-Challenge 25-shot, GSM8K 8-shot, HellaSwag 5-shot, MATH 4-shot.

Both suites can be selected together (AB), in which case arc_challenge runs at both 0-shot AND 25-shot, etc.

Output layout

synthetic_data/{model_short}/temp{T}/                        # stage 1
synthetic_data_clean/{model_short}/temp{T}_clean_n{N}/       # stage 2
└── subsets/                                                  # stage 3
output/{run_name}/checkpoint-*/                              # stage 4
evals/{model_short}/{base|run_name/checkpoint-*}/            # evaluate.sh
    {benchmark}_fs{N}/results.json

Reproducing Qwen2.5-0.5B at τ=1.25 (paper main run)

Defaults in pipeline.sh are pre-set to this configuration. Run pipeline.sh, accept the defaults at every prompt. After training completes, run evaluate.sh and accept the defaults (Suite A) to evaluate on the paper-core 7 benchmarks at zero-shot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Latent Capability Resurfacing — Code

Repository layout

Quick start

What setup.sh does

pipeline.sh

evaluate.sh

Output layout

Reproducing Qwen2.5-0.5B at τ=1.25 (paper main run)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
src		src
README.md		README.md
evaluate.sh		evaluate.sh
pipeline.sh		pipeline.sh
requirements.txt		requirements.txt
setup.sh		setup.sh

Folders and files

Latest commit

History

Repository files navigation

Latent Capability Resurfacing — Code

Repository layout

Quick start

What setup.sh does

pipeline.sh

evaluate.sh

Output layout

Reproducing Qwen2.5-0.5B at τ=1.25 (paper main run)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages