Skip to content

finetune.cpp command-line arg #14773

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

lexasub
Copy link
Contributor

@lexasub lexasub commented Jul 19, 2025

add to ggml-opt learning rate (adamw alpha) cmdline arg, and an optimizer enum defaulting to adamw,
preparatory to work to support SGD

these are in common args a set of optimizer options active only for the new FINETUNE example (which includes all the previous finetune.cpp PERPLEXITY options as a precaution)

perhaps breaking with precedent, the ggml_opt_optimizer_params struct is included directly as args - if desired, we can instead just add learning rate and optimizer type to a struct independent of ggml-opt.h

as proposed in
#13835
rebase #13873 graehl:finelayer to master

add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors.

support finetune.cpp arg -opt SGD (or sgd). (default adamw as before)

llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch)
when using SGD instead of 19gb (55 sec/epoch) using adamw.
(wikipedia 100 lines finetune)

(
using the same GPU memory, adamw can only do before OOM 512
batch/context, reaching:
train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00
val:   [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00

SGD is superior, though it converges slower, with max before OOM 1728
batch/context (esp see the better validation perf):
train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00
val:   [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00
)

note: when finetuning long enough (or w/ enough -lr),
validation accuracy *eventually* drops ('catastrophic forgetting')

-lr-half (halflife) option useful for SGD to avoid oscillation or
super slow underdamped learning (makes setting -lr more forgiving).
terminal -lr for now is set by lr-halvings i.e. if you want at most
1/8 the inital -lr you set -lr-halvings 3.

note: objective loss not directly comparable between adamw, sgd? -
check perplexity or accuracy or consider relative improvements
for convergence

new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)

cache (1 - wd*alpha) in 'adamw' opt struct -
no noticeable perf benefit, disabled (still done
for new SGD though)

since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no cmdline arg
to set such a policy yet)

test-opt checks adamw as before and now sgd (except for a few disabled
tests for sgd only; probably just needs logging values and adding
alternate reference values);  tolerance on the 'regression'
test is broader for sgd (so we don't need many more epochs)
@github-actions github-actions bot added build Compilation issues testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend examples ggml changes relating to the ggml tensor library for machine learning labels Jul 19, 2025
@lexasub
Copy link
Contributor Author

lexasub commented Jul 19, 2025

@graehl @JohannesGaessler rebase original graehl:finelayer to master

@lexasub
Copy link
Contributor Author

lexasub commented Jul 19, 2025

@graehl need fix for webgpu))
some insites from github copilot (This seems like complete garbage)

1. "Unrecognized schema" Error in json-schema-to-grammar.mjs
Cause:
The error is raised at line 716 in tools/server/public_legacy/json-schema-to-grammar.mjs when the schema has an unexpected type, e.g. { "type": "kaboom" } or { "type": 123 }.

Solution:

Update your tests to only use supported schema types.
If you intend to support more types, extend your schema converter:
js
// At or before line 716 in json-schema-to-grammar.mjs
switch (schema.type) {
  case "string":
  case "number":
  case "object":
    // handle known types
    break;
  default:
    throw new Error(`Unrecognized schema: ${JSON.stringify(schema)}`);
}
Document valid schema types or add test cases for expected failures.
2. "Failed to infer a tool call example (possible template bug)" in chat template tests
Cause:
The template logic in /build/bin/test-chat-template can't infer example tool calls, suggesting a bug in the template definition or missing/invalid test parameters.

Solution:

Check your chat template logic for missing example cases.
Ensure all expected tool call examples are defined.
Add fallback logic or improve error messaging:
C++
// Pseudocode: If no example can be inferred, return a generic message or handle gracefully.
if (!can_infer_example()) {
    std::cerr << "Tool call example could not be inferred. Check template and test data." << std::endl;
}
3. GGUF File Format Errors in test-gguf
Cause:
Multiple errors (invalid magic, version mismatch, tensor issues) indicate the test is using bad or outdated GGUF files, or the GGUF reader code needs updates.

Solution:

Update or regenerate GGUF files used in the tests to match supported versions and formats.
In your GGUF handling code, add more robust validation and clearer error messages.
Example: Ensure version checks and magic headers are correct:
C++
if (magic != EXPECTED_MAGIC) {
    throw std::runtime_error("Invalid GGUF magic header.");
}
if (version > SUPPORTED_VERSION) {
    throw std::runtime_error("GGUF version not supported. Please update your software.");
}
Review tensor constraints (name length, dimensions, alignment) in your test files and code.
4. "test-opt Subprocess Aborted"
Cause:
Test #29 (test-opt) crashed, likely due to a segmentation fault or unhandled exception.

Solution:

Check /build/Testing/Temporary/LastTest.log for details.
Review code in the test-opt binary for unsafe operations, uninitialized pointers, or memory issues.
Add error handling and assertions to catch bad states before crashing:
C++
assert(ptr != nullptr); // Example safety check
5. Argument Parsing Errors
Cause:
Errors like error while handling argument "-m": expected value for argument indicate missing or invalid command-line arguments in your test invocation scripts.

Solution:

Ensure all required arguments are passed in your CI scripts and test definitions.
Add checks for missing or invalid arguments, and provide usage help.
Example fix for test script:

sh
# Make sure to pass a valid model path to -m
./test-binary -m /path/to/model
Summary of Immediate Actions:

Fix schema types in tests or extend schema handling logic.
Check template and tool call example logic for completeness.
Update GGUF test files and handling code.
Investigate and fix the crash in test-opt.
Ensure all test scripts pass valid arguments.```

@lexasub lexasub closed this Jul 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants