finetune.cpp command-line arg #14773

lexasub · 2025-07-19T15:37:24Z

add to ggml-opt learning rate (adamw alpha) cmdline arg, and an optimizer enum defaulting to adamw,
preparatory to work to support SGD

these are in common args a set of optimizer options active only for the new FINETUNE example (which includes all the previous finetune.cpp PERPLEXITY options as a precaution)

perhaps breaking with precedent, the ggml_opt_optimizer_params struct is included directly as args - if desired, we can instead just add learning rate and optimizer type to a struct independent of ggml-opt.h

as proposed in
#13835
rebase #13873 graehl:finelayer to master

add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating m, v tensors. support finetune.cpp arg -opt SGD (or sgd). (default adamw as before) llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch) when using SGD instead of 19gb (55 sec/epoch) using adamw. (wikipedia 100 lines finetune) ( using the same GPU memory, adamw can only do before OOM 512 batch/context, reaching: train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00 val: [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00 SGD is superior, though it converges slower, with max before OOM 1728 batch/context (esp see the better validation perf): train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00 val: [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00 ) note: when finetuning long enough (or w/ enough -lr), validation accuracy *eventually* drops ('catastrophic forgetting') -lr-half (halflife) option useful for SGD to avoid oscillation or super slow underdamped learning (makes setting -lr more forgiving). terminal -lr for now is set by lr-halvings i.e. if you want at most 1/8 the inital -lr you set -lr-halvings 3. note: objective loss not directly comparable between adamw, sgd? - check perplexity or accuracy or consider relative improvements for convergence new finetune args -wd 1e-9 to enable weight decay in sgd or adamw, and max -epochs N (default 2 as before) cache (1 - wd*alpha) in 'adamw' opt struct - no noticeable perf benefit, disabled (still done for new SGD though) since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params would probably be able to change between SGD and AdamW with each epoch but would need to use adamw for the first (unconfirmed - no cmdline arg to set such a policy yet) test-opt checks adamw as before and now sgd (except for a few disabled tests for sgd only; probably just needs logging values and adding alternate reference values); tolerance on the 'regression' test is broader for sgd (so we don't need many more epochs)

lexasub · 2025-07-19T15:38:19Z

@graehl @JohannesGaessler rebase original graehl:finelayer to master

lexasub · 2025-07-19T15:54:03Z

@graehl need fix for webgpu))
some insites from github copilot (This seems like complete garbage)

1. "Unrecognized schema" Error in json-schema-to-grammar.mjs
Cause:
The error is raised at line 716 in tools/server/public_legacy/json-schema-to-grammar.mjs when the schema has an unexpected type, e.g. { "type": "kaboom" } or { "type": 123 }.

Solution:

Update your tests to only use supported schema types.
If you intend to support more types, extend your schema converter:
js
// At or before line 716 in json-schema-to-grammar.mjs
switch (schema.type) {
  case "string":
  case "number":
  case "object":
    // handle known types
    break;
  default:
    throw new Error(`Unrecognized schema: ${JSON.stringify(schema)}`);
}
Document valid schema types or add test cases for expected failures.
2. "Failed to infer a tool call example (possible template bug)" in chat template tests
Cause:
The template logic in /build/bin/test-chat-template can't infer example tool calls, suggesting a bug in the template definition or missing/invalid test parameters.

Solution:

Check your chat template logic for missing example cases.
Ensure all expected tool call examples are defined.
Add fallback logic or improve error messaging:
C++
// Pseudocode: If no example can be inferred, return a generic message or handle gracefully.
if (!can_infer_example()) {
    std::cerr << "Tool call example could not be inferred. Check template and test data." << std::endl;
}
3. GGUF File Format Errors in test-gguf
Cause:
Multiple errors (invalid magic, version mismatch, tensor issues) indicate the test is using bad or outdated GGUF files, or the GGUF reader code needs updates.

Solution:

Update or regenerate GGUF files used in the tests to match supported versions and formats.
In your GGUF handling code, add more robust validation and clearer error messages.
Example: Ensure version checks and magic headers are correct:
C++
if (magic != EXPECTED_MAGIC) {
    throw std::runtime_error("Invalid GGUF magic header.");
}
if (version > SUPPORTED_VERSION) {
    throw std::runtime_error("GGUF version not supported. Please update your software.");
}
Review tensor constraints (name length, dimensions, alignment) in your test files and code.
4. "test-opt Subprocess Aborted"
Cause:
Test #29 (test-opt) crashed, likely due to a segmentation fault or unhandled exception.

Solution:

Check /build/Testing/Temporary/LastTest.log for details.
Review code in the test-opt binary for unsafe operations, uninitialized pointers, or memory issues.
Add error handling and assertions to catch bad states before crashing:
C++
assert(ptr != nullptr); // Example safety check
5. Argument Parsing Errors
Cause:
Errors like error while handling argument "-m": expected value for argument indicate missing or invalid command-line arguments in your test invocation scripts.

Solution:

Ensure all required arguments are passed in your CI scripts and test definitions.
Add checks for missing or invalid arguments, and provide usage help.
Example fix for test script:

sh
# Make sure to pass a valid model path to -m
./test-binary -m /path/to/model
Summary of Immediate Actions:

Fix schema types in tests or extend schema handling logic.
Check template and tool call example logic for completeness.
Update GGUF test files and handling code.
Investigate and fix the crash in test-opt.
Ensure all test scripts pass valid arguments.```

lexasub requested a review from JohannesGaessler as a code owner July 19, 2025 15:37

github-actions bot added build Compilation issues testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend examples ggml changes relating to the ggml tensor library for machine learning labels Jul 19, 2025

lexasub closed this Jul 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

finetune.cpp command-line arg #14773

finetune.cpp command-line arg #14773

Uh oh!

lexasub commented Jul 19, 2025

Uh oh!

lexasub commented Jul 19, 2025 •

edited

Loading

Uh oh!

lexasub commented Jul 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

finetune.cpp command-line arg #14773

finetune.cpp command-line arg #14773

Uh oh!

Conversation

lexasub commented Jul 19, 2025

Uh oh!

lexasub commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lexasub commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

lexasub commented Jul 19, 2025 •

edited

Loading

lexasub commented Jul 19, 2025 •

edited

Loading