Apply optimizer to model weights without data copy #222

milancurcic · 2025-05-28T17:45:36Z

Alternative approach to #184

Currently implemented only for dense layer so a lot of other stuff is not working. We now send the pointers to each layer's weights and biases to the optimizer where they are updated in-place.

Since this approach runs the optimizer on a layer-by-layer basis, optimizer memory such as velocity, rms gradients, etc. are not implemented yet, as they originally assumed whole-network update. Additional bookkeeping is needed there. A possible approach to this bookkeeping is to require the caller of optimizer % minimize() to pass explicit start and end indices over which the optimizer will run. This may seem tedious but helper functions can make it easy.

On MNIST training (examples/dense_mnist), the difference is only ~1% in run time. However, running the profiler on both the main branch dense_mnist and this branch's dense_mnist shows that the data copies (mostly in get_params; two redundant copies of all model parameters happening there) are now gone:

main branch dense_mnist:

$ head -15 dense_mnist_orig_fast_analysis.txt
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 89.75      5.43     5.43   998400     0.00     0.00  __nf_dense_layer_MOD_backward
  3.14      5.62     0.19  1418400     0.00     0.00  __nf_dense_layer_MOD_forward
  2.15      5.75     0.13   499200     0.00     0.00  __nf_random_MOD_shuffle
  1.32      5.83     0.08     3900     0.00     0.00  __nf_network_MOD_get_params
  0.66      5.87     0.04     7800     0.00     0.00  __nf_layer_MOD_get_params
  0.33      5.89     0.02  1707600     0.00     0.00  __nf_activation_MOD_eval_1d_softmax
  0.33      5.91     0.02     7800     0.00     0.00  __nf_dense_layer_MOD_set_params
  0.33      5.93     0.02     3900     0.00     0.00  __nf_network_MOD_update
  0.33      5.95     0.02     3900     0.00     0.00  __nf_optimizers_MOD_minimize_sgd
  0.33      5.97     0.02                             _init

This PR dense_mnist:

$ head -15 dense_mnist_new_fast_analysis.txt
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 93.84      4.72     4.72   998400     0.00     0.00  __nf_dense_layer_MOD_backward
  2.98      4.87     0.15  1418400     0.00     0.00  __nf_dense_layer_MOD_forward
  0.99      4.92     0.05  1707600     0.00     0.00  __nf_activation_MOD_eval_1d_softmax
  0.99      4.97     0.05   499200     0.00     0.00  __nf_random_MOD_shuffle
  0.40      4.99     0.02     7800     0.00     0.00  __nf_optimizers_MOD_minimize_sgd_2d
  0.40      5.01     0.02     3900     0.00     0.00  __nf_network_MOD_update
  0.20      5.02     0.01   709200     0.00     0.00  __nf_network_MOD_forward_1d
  0.20      5.03     0.01                             _init
  0.00      5.03     0.00  2127600     0.00     0.00  __nf_layer_MOD_forward
  0.00      5.03     0.00  1497600     0.00     0.00  __nf_layer_MOD_backward_1d

The above runs were compiled using gfortran-14.2.0 and -pg -Ofast flags.

src/nf/nf_dense_layer.f90

jvdp1 · 2025-05-30T11:56:01Z

Since this approach runs the optimizer on a layer-by-layer basis, optimizer memory such as velocity,
rms gradients, etc. are not implemented yet, as they originally assumed whole-network update.

Another issue I just found is that I think that arrays of the optimizer, e.g., m and v in adam must be also defined per layer and for each weight and bias arrays in a layer.

milancurcic · 2025-05-30T17:57:53Z

@jvdp1 Indeed, that's the extra bookkeeping needed. However, I'm thinking now if we should instantiate an optimizer per layer rather than the whole network. This would take care of the bookkeeping altogether, because now the "memory" arrays like m and v would remember the state per layer, which is what is needed. This approach would only be problematic if if in the future we would need to implement an optimizer that needs to do some kind of global reduction of "memory" across layers. I'm not aware of such optimizers but I'll do some research.

…connected1d

… switched to per-layer optimizer instances

…ed 2 calls per batch; this is now generalized to allow any number of calls until size(params) is exhausted

milancurcic · 2025-07-29T18:33:51Z

All layers, with the exception of embedding and MHA now implement getting parameters and gradients as pointers. This removes the need for having get_{params,gradients} functions that create a copy. These are currently left there until {network,layer} % {get,set}_params is refactored to use the pointer-based subroutines.

MHA and embedding are left alone because they are a bit more complex and I don't yet feel comfortable refactoring those, but they should be switched to the new approach as well. Note that these layers were not previously integrated with the network % update method, but rather any optimization was done directly on the layers, so no breakage is introduced with the new approach.

milancurcic added 3 commits May 23, 2025 09:52

WIP optimizer refactor w/ pointers

bf1478e

WIP optimizer optimization

38896cc

Send the data to optimizer without a copy works for dense layers

21c5707

jvdp1 reviewed May 28, 2025

View reviewed changes

src/nf/nf_dense_layer.f90 Outdated Show resolved Hide resolved

Get weights and weight gradients as 1d

9d68828

milancurcic added 6 commits June 19, 2025 23:49

get_params_ptr and get_gradients_ptr for conv1d, conv2d, and locally_…

2160f97

…connected1d

Define optimizer instance per layer to preserve memory across layers

0e11f10

Initialization of network-wide optimizer no longer needed now that we…

dc55df0

… switched to per-layer optimizer instances

Bookkeeping for velocity, rms_gradient, etc.; optimizer tests now pass

e9ba73e

Update optimizer flow for linear2d

ad176ea

Update optimizer flow for layernorm

e5072d3

milancurcic marked this pull request as ready for review July 29, 2025 17:44

milancurcic added 2 commits July 29, 2025 14:03

Previous bookkeeping for successive calls to optim % minimize() assum…

86ed7b3

…ed 2 calls per batch; this is now generalized to allow any number of calls until size(params) is exhausted

Remove get_gradients from network, layer, dense, conv1d, conv2d

309ef6e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Apply optimizer to model weights without data copy #222

Apply optimizer to model weights without data copy #222

Uh oh!

milancurcic commented May 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

jvdp1 commented May 30, 2025

Uh oh!

milancurcic commented May 30, 2025

Uh oh!

milancurcic commented Jul 29, 2025

Uh oh!

Uh oh!

Apply optimizer to model weights without data copy #222

Are you sure you want to change the base?

Apply optimizer to model weights without data copy #222

Uh oh!

Conversation

milancurcic commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jvdp1 commented May 30, 2025

Uh oh!

milancurcic commented May 30, 2025

Uh oh!

milancurcic commented Jul 29, 2025

Uh oh!

Uh oh!

milancurcic commented May 28, 2025 •

edited

Loading