Fix ModelParallel OOM issue during weight loading #21712

amitsrivastava78 · 2025-10-03T08:14:34Z

Modified load_own_variables() to use _direct_assign() for sharded variables
Prevents loading full weight tensors on single device before distribution
Resolves RESOURCE_EXHAUSTED errors when loading large models with ModelParallel
Maintains backward compatibility for non-sharded variables
Enables loading of models like Gemma2 2B/7B without OOM errors

gemini-code-assist · 2025-10-03T08:14:58Z

Summary of Changes

Hello @amitsrivastava78, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces crucial improvements to Keras's JAX backend, specifically targeting Out-Of-Memory (OOM) issues that arise when loading large models in a ModelParallel distributed setup. By modifying how sharded variables are initialized and assigned, the system now avoids loading entire weight tensors onto a single device, instead distributing them directly across available devices. This change, encapsulated in a new load_variable_with_sharded_support utility, ensures efficient memory usage and enables the successful loading of very large models like Gemma2 2B/7B. Additionally, it enhances the robustness of dynamic shape handling and prevents premature garbage collection of sharded JAX arrays, solidifying Keras's distributed capabilities on JAX.

Highlights

OOM Fix for ModelParallel: Modified weight loading logic to prevent Out-Of-Memory (OOM) errors when using ModelParallel with large models, especially on the JAX backend.
Sharded Variable Loading: Implemented a new utility function, load_variable_with_sharded_support, to ensure sharded variables are loaded directly onto their respective devices, avoiding full tensor materialization on a single device.
JAX Backend Enhancements: Updated JaxVariable and NnxVariable to manage strong references to sharded JAX arrays, preventing premature garbage collection and ensuring data availability during inference.
Dynamic Shape Handling: Improved compute_output_spec in the JAX backend to more robustly handle dynamic shapes during symbolic tracing.
Broad Layer Support: Integrated the new sharded loading mechanism across various Keras layers (e.g., Dense, Conv, Embedding, EinsumDense, BatchNormalization) and optimizers.
Comprehensive Testing: Added new tests to validate the correct behavior of sharded array protection, strong references, and the end-to-end sharded variable loading process for ModelParallel.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request addresses an out-of-memory (OOM) issue during weight loading for models using ModelParallel by introducing sharded variable assignment. The changes primarily involve modifying load_own_variables methods to use a new _direct_assign approach for sharded variables, preventing the full weight tensor from being loaded onto a single device. A new helper function, load_variable_with_sharded_support, centralizes this logic, and it has been integrated into various layers and the base optimizer.

My review has identified a few areas for improvement, including a potential memory leak, duplicated code, and some inconsistencies. Addressing these points will enhance the robustness and maintainability of the solution. Overall, the changes are well-structured and include thorough testing, which is excellent.

keras/src/backend/jax/core.py

keras/src/layers/preprocessing/index_lookup.py

keras/src/utils/variable_loading.py

codecov-commenter · 2025-10-03T08:20:57Z

Codecov Report

❌ Patch coverage is 77.13004% with 51 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.56%. Comparing base (3fac66f) to head (8beda65).
⚠️ Report is 7 commits behind head on master.

Files with missing lines	Patch %	Lines
keras/src/backend/jax/core.py	72.78%	31 Missing and 12 partials ⚠️
keras/src/layers/core/dense.py	91.66%	1 Missing and 1 partial ⚠️
keras/src/layers/core/einsum_dense.py	91.66%	1 Missing and 1 partial ⚠️
keras/src/layers/core/embedding.py	75.00%	1 Missing and 1 partial ⚠️
keras/src/layers/preprocessing/index_lookup.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #21712      +/-   ##
==========================================
- Coverage   82.60%   82.56%   -0.04%     
==========================================
  Files         572      572              
  Lines       58326    58710     +384     
  Branches     9134     9195      +61     
==========================================
+ Hits        48179    48474     +295     
- Misses       7817     7887      +70     
- Partials     2330     2349      +19

Flag	Coverage Δ
keras	`82.36% <77.13%> (-0.04%)`	⬇️
keras-jax	`63.22% <76.68%> (-0.10%)`	⬇️
keras-numpy	`57.47% <32.73%> (-0.19%)`	⬇️
keras-openvino	`34.25% <7.62%> (-0.06%)`	⬇️
keras-tensorflow	`63.83% <32.73%> (-0.22%)`	⬇️
keras-torch	`63.38% <33.18%> (-0.26%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hertschuh

Thanks for the PR!

It's unfortunate that this is combining 3 different things (which you describe in your doc):

initialization of sharded variables
reloading of sharded variables
reference counting of variables

In particular, I don't think we should do 3 because both Python and JAX already track usage of arrays. Therefore:

I believe it's hiding some other bug
Adding our own tracking system on top is error prone and is probably going to add memory leaks because it's very easy to forget to clear references.

hertschuh · 2025-10-07T00:01:09Z

keras/src/backend/jax/core.py

 import ml_dtypes
 import numpy as np
-from jax import export as jax_export
+from absl import logging


This PR is undoing a lot of changes that were made in this file. It wasn't rebased correctly.

ok rebased again and ensured no fixes are missing also restored jax_export and other fixes overwritten

hertschuh · 2025-10-07T00:02:21Z

keras/src/backend/jax/core.py

 IS_THREAD_SAFE = True


+def _is_jax_tracer(x):


Can't you use (and potentially change) the one from jax_utils?

hertschuh · 2025-10-07T00:07:14Z

keras/src/utils/variable_loading.py

+    # Check if variable has a layout (is sharded)
+    if hasattr(variable, "_layout") and variable._layout is not None:
+        # Use _direct_assign for sharded variables to avoid OOM
+        logging.info(


This logging will be very noisy, let's remove it.

hertschuh · 2025-10-07T00:08:34Z

keras/src/utils/variable_loading.py

+from absl import logging
+
+
+def load_variable_with_sharded_support(variable, weight_data):


I don't think we need this, you can just do variable._direct_assign(weight_data) everywhere you used load_variable_with_sharded_support.

That check for if hasattr(variable, "_layout") and variable._layout is not None: is already done within _direct_assign.

hertschuh · 2025-10-07T00:11:25Z

keras/src/backend/jax/core.py

+        )
+
+        # Ensure value is on host (numpy array)
+        if not isinstance(value, np.ndarray):


But that's too late. The initializer has already been called on device:0. For this to work the way you intend it to, you need to run the initializer on CPU using a with device(...) scope.

hertschuh · 2025-10-07T00:19:44Z

keras/src/distribution/distribution_lib_test.py

+                        and loaded_var._shard_references
+                    )
+
+                    logging.debug(


Remove all logging.

hertschuh · 2025-10-07T00:23:06Z

keras/src/layers/core/dense.py

+        if self.quantization_mode == "gptq":
+            # GPTQ: bias first, then quantized_kernel
+            target_variables = [self.bias] if self.use_bias else []
+            target_variables.append(self.quantized_kernel)
+        else:
+            target_variables = [self._kernel]
+        if self.use_bias and self.quantization_mode != "gptq":
+            target_variables.append(self.bias)
+        if self.quantization_mode is not None:
+            if self.quantization_mode in ("int8", "int4"):
+                target_variables.append(self.kernel_scale)
+            elif self.quantization_mode == "float8":
+                target_variables.append(self.inputs_scale)
+                target_variables.append(self.inputs_amax_history)
+                target_variables.append(self.kernel_scale)
+                target_variables.append(self.kernel_amax_history)
+                target_variables.append(self.outputs_grad_scale)
+                target_variables.append(self.outputs_grad_amax_history)
+            elif self.quantization_mode == "gptq":
+                target_variables.append(self.kernel_scale)
+                target_variables.append(self.kernel_zero)
+                target_variables.append(self.g_idx)
+            else:
+                raise self._quantization_mode_error(self.quantization_mode)
+        for i, variable in enumerate(target_variables):
+            weight_data = store[str(i)]
+            load_variable_with_sharded_support(variable, weight_data)


This code is being changed I believe.

But why couldn't you do a 1-line change:

< variable.assign(store[str(i)]) --- > load_variable_with_sharded_support(variable, store[str(i)])

hertschuh · 2025-10-07T00:23:30Z

keras/src/layers/core/einsum_dense.py

+                raise self._quantization_mode_error(self.quantization_mode)
+        for i, variable in enumerate(target_variables):
+            weight_data = store[str(i)]
+            load_variable_with_sharded_support(variable, weight_data)


Same comment about this code being changed and a 1-line change.

hertschuh · 2025-10-07T00:23:48Z

keras/src/layers/core/embedding.py

+                raise self._quantization_mode_error(self.quantization_mode)
+        for i, variable in enumerate(target_variables):
+            weight_data = store[str(i)]
+            load_variable_with_sharded_support(variable, weight_data)


Same comment about this code being changed and a 1-line change.

hertschuh · 2025-10-07T00:28:39Z

keras/src/distribution/distribution_lib_test.py

+                        f"{has_shard_refs_loaded}"
+                    )
+
+                    self.assertTrue(


Using assertTrue forces you to put a specific error message to provide context.

I would replace all of this:

has_shard_refs_orig = ( hasattr(orig_var, "_shard_references") and orig_var._shard_references ) logging.debug( f" Original has shard references: " f"{has_shard_refs_orig}" ) self.assertTrue( has_shard_refs_orig, f"Original {var_name} should have shard references", ) self.assertGreater( len(orig_var._shard_references), 0, f"Original {var_name} has empty shard references", )

With line:

self.assertLen(orig_var._shard_references, 1)

Not only it's a lot less code, it will actually gives you more information in case of error:

if orig_var doesn't have _shard_references as an attribute, it will raise an error telling you exactly that

if orig_var._shard_references is None, len will fail telling you it's None, so you'll know (has_shard_refs_orig won't tell you directly if the attribute is missing or None, you'll have to look at the debug loggin)

assertLen will tell you what you're taking the len of whereas assertGreater will tell you 0 < 0 which is not super helpful.

google-ml-butler bot added the size:XL label Oct 3, 2025

google-ml-butler bot assigned gbaned Oct 3, 2025

github-actions bot added the Gemma Gemma model specific issues label Oct 3, 2025

gemini-code-assist bot reviewed Oct 3, 2025

View reviewed changes

keras/src/backend/jax/core.py Outdated Show resolved Hide resolved

keras/src/backend/jax/core.py Outdated Show resolved Hide resolved

keras/src/layers/preprocessing/index_lookup.py Outdated Show resolved Hide resolved

keras/src/utils/variable_loading.py Outdated Show resolved Hide resolved

amitsrivastava78 force-pushed the master branch 2 times, most recently from 7816f0c to af6c766 Compare October 3, 2025 13:02

amitsrivastava78 added the kokoro:force-run label Oct 3, 2025

kokoro-team removed the kokoro:force-run label Oct 3, 2025

amitsrivastava78 mentioned this pull request Oct 6, 2025

Model sharding fails due to early weight loading - OOM on large models #21634

Open

amitsrivastava78 requested a review from hertschuh October 6, 2025 03:39

google-ml-butler bot added the awaiting review label Oct 6, 2025

amitsrivastava78 added the kokoro:force-run label Oct 6, 2025

kokoro-team removed the kokoro:force-run label Oct 6, 2025

amitsrivastava78 force-pushed the master branch from 3a86f1e to 303f241 Compare October 6, 2025 05:01

amitsrivastava78 added the kokoro:force-run label Oct 6, 2025

kokoro-team removed the kokoro:force-run label Oct 6, 2025

hertschuh reviewed Oct 7, 2025

View reviewed changes

amitsrivastava78 added the kokoro:force-run label Oct 7, 2025

kokoro-team removed the kokoro:force-run label Oct 7, 2025

amitsrivastava78 closed this Oct 7, 2025

amitsrivastava78 force-pushed the master branch from 8beda65 to 0ecb55d Compare October 7, 2025 04:23

google-ml-butler bot removed the awaiting review label Oct 7, 2025

		from absl import logging


		def load_variable_with_sharded_support(variable, weight_data):

Fix ModelParallel OOM issue during weight loading #21712

Fix ModelParallel OOM issue during weight loading #21712

Conversation

amitsrivastava78 commented Oct 3, 2025

Uh oh!

gemini-code-assist bot commented Oct 3, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hertschuh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented Oct 3, 2025 •

edited

Loading