Skip to content

Extend add_linear_biases to support a dictionary of sub-layers to which linear bias should be added. #158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

bigximik
Copy link
Contributor

@bigximik bigximik commented Feb 24, 2025

✨ Description

A possibility to specify the addition of linear biases per layer and attention and mlp sub-layers in the form of a dictionary, with sub-layer keys and their presence in layers, like this:

 add_linear_biases: 
     "layers.self_attn.query": "*"
     "layers.mlp.layer_1": "1:10:3, 9"
     "layers.mlp.layer_2": "5:7"

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

  1. Adds support for add_linear_biases as dict

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

  • 📜 I have read and followed the contributing guidelines.
  • 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
  • 🎉 The functionality is complete, and I have tested the changes. (tested only new test cases)
  • 📝 I have updated the documentation if needed. (not applicable)
  • ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
  • 🧩 I have commented my code, especially in hard-to-understand areas.

Testing

  • 🧪 I have added or updated tests to cover my changes.
  • ✔️ New and existing tests pass locally with my changes. (tested only new test cases)
  • 🚦 I have tested these changes on GPUs and verified training stability. (not applicable)
  • 🏋️ I have tested the changes on realistic training workloads, if applicable. (not applicable)

@bigximik
Copy link
Contributor Author

I have two questions:

  • Do we want to specify sub-layers like this: "layers.self_attn.query" or use a simpler format like attention_query?
  • Transformer layer indexes in Fast-LLM start from 1, as layer 0 is the embedding layer. I’ve implemented zero-based indexes in the config, but should we switch to one-based indexes in the config as well to align with Fast-LLM's internal nomenclature?

@tscholak
Copy link
Collaborator

Thanks @bigximik. Do we need regex pattern matching for Qwen2 for layer-wise application of linear biases? I thought that Qwen2 used bias terms in all q,k,v layers, so this feature is technically not needed, or is it?

@bigximik
Copy link
Contributor Author

Qwen2 does not need it. I made it generic to be more in line with #155

@bigximik
Copy link
Contributor Author

This will be in draft, for Qwen2 support this is implemented in a while #160

@jlamypoirier
Copy link
Collaborator

I think #168 will make these kinds of things unnecessary?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants