Skip to content

Conversation

@vovanphuc
Copy link
Contributor

PR Description

Summary

Add support for LiquidAI's LFM2.5-Audio model with full training capability via liquid_audio package integration.

  • Add LFM2AudioPlugin for audio placeholder handling with correct token boundaries
  • Add lfm2_audio template with ChatML-style formatting
  • Add custom model loader wrapper for liquid_audio.LFM2AudioModel
  • Add model group registration in constants

Model Information

Attribute Value
Model LiquidAI/LFM2.5-Audio-1.5B
Architecture FastConformer encoder + LFM2-1.2B backbone
Modality Audio-to-Text (ASR/instruction following)
Parameters 1.5B

Requirements

LFM2.5-Audio requires the liquid_audio package for model loading:

pip install liquid-audio

Token Structure

The plugin correctly handles LFM2.5-Audio's audio boundary markers:
<|audio_start|><|reserved_1|><|text_start|>
- <|audio_start|> (token 128): Audio region start
- <|reserved_1|> (token 17): Audio placeholder token
- <|text_start|> (token 129): Audio region end / text start

Validation

- make style - passed
- make quality - passed
- Plugin unit tests - passed
- LoRA training smoke test - passed (100 steps, loss: 10.07 → 0.0004)

Training Verification

llamafactory-cli train \
  --model_name_or_path LiquidAI/LFM2.5-Audio-1.5B \
  --template lfm2_audio \
  --finetuning_type lora \
  --lora_target q_proj,k_proj,v_proj,out_proj,w1,w2,w3 \
  --dataset mllm_audio_demo \
  --output_dir saves/lfm2-audio

Training output:
trainable params: 5,550,080 || all params: 1,459,055,744 || trainable%: 0.3804
train_loss: 1.69 (100 steps)

Related

- LFM2.5 text model: #9726
- LFM2.5-VL vision model: #9729
- [LFM2.5-Audio-1.5B](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B)
- liquid-audio package: https://pypi.org/project/liquid-audio/

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @vovanphuc, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the framework's capabilities by introducing comprehensive support for the LiquidAI LFM2.5-Audio model. It integrates the liquid_audio package to facilitate audio processing and model loading, allowing users to train and fine-tune this audio-to-text model within the existing ecosystem. The changes ensure proper handling of audio input, tokenization, and model configuration, making the LFM2.5-Audio model a first-class citizen for multimodal tasks.

Highlights

  • LFM2.5-Audio Model Integration: Added comprehensive support for the LiquidAI LFM2.5-Audio model, enabling full training and inference capabilities through integration with the liquid_audio package.
  • New Audio Plugin: Introduced LFM2AudioPlugin to manage audio input processing, including validation, feature extraction (prioritizing liquid_audio's processor), and intelligent replacement of audio placeholders with appropriate boundary tokens.
  • ChatML-style Template: Registered a new lfm2_audio template, defining ChatML-style formatting for user, assistant, system, function, and observation messages, ensuring consistent interaction with the LFM2.5-Audio model.
  • Custom Model Loader: Implemented a custom model loader wrapper (LFM2AudioModelForCausalLM) to seamlessly integrate liquid_audio.LFM2AudioModel with the HuggingFace PreTrainedModel and GenerationMixin interfaces, facilitating its use within the framework's training and generation pipelines.
  • Model Group Registration: Registered the LFM2.5-Audio-1.5B model within the system's constants, linking it to the lfm2_audio template and marking it as multimodal for improved discoverability and configuration.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the LFM2.5-Audio model, including a new data plugin, chat template, and model loader. The implementation is well-structured and follows the existing patterns in the codebase. I've identified a potential correctness issue in how audio sequence lengths are calculated, which might lead to incorrect placeholder expansion for variable-length audios. Additionally, I've noticed the use of magic numbers for token IDs in the model wrapper and a test case that could be strengthened to cover more scenarios. My detailed feedback and suggestions are in the comments below.

Comment on lines 2223 to 2225
if hasattr(features, "shape"):
seq_len = (features.shape[-1] - 1) // 8 + 1
mm_inputs["audio_seq_lengths"] = [seq_len] * len(audios)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation for calculating audio_seq_lengths assumes all audios in a batch have the same length. It computes a single seq_len based on the padded feature length and applies it to all audio files. This can lead to an incorrect number of placeholder tokens for shorter audio files in a batch with variable-length audios.

The fallback path for Hugging Face's feature_extractor is more robust as it uses the attention_mask to determine the actual length of each audio. I recommend a similar approach here. Please check if the liquid_audio processor can return an attention mask or a list of lengths. If not, you might need to compute the sequence lengths based on the lengths of the audios_regularized list before they are padded and passed to the audio_processor.

Comment on lines +82 to +85
self.generation_config = GenerationConfig(
eos_token_id=config.eos_token_id if hasattr(config, "eos_token_id") else 7,
pad_token_id=config.pad_token_id if hasattr(config, "pad_token_id") else 0,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The GenerationConfig is initialized with hardcoded fallback values for eos_token_id (7) and pad_token_id (0). Using such magic numbers is not ideal as it can lead to subtle bugs if the model's actual token IDs are different. The PR description mentions special tokens with IDs 17, 128, and 129, but not 7 or 0, which makes these defaults more concerning.

It would be more robust to ensure these values are correctly populated from the model's configuration. If these are indeed fixed values for this model family, consider defining them as named constants with explanatory comments.

Suggested change
self.generation_config = GenerationConfig(
eos_token_id=config.eos_token_id if hasattr(config, "eos_token_id") else 7,
pad_token_id=config.pad_token_id if hasattr(config, "pad_token_id") else 0,
)
self.generation_config = GenerationConfig(
eos_token_id=getattr(config, "eos_token_id", None),
pad_token_id=getattr(config, "pad_token_id", None),
)

- Fix audio seq_lengths calculation to handle variable-length audios
  (previously assumed all audios had same length)
- Add comments documenting magic number token IDs (7=<|im_end|>, 0=<unk>)
- Improve test coverage with 3 additional test cases:
  - Multiple audio placeholders
  - Text-only messages
  - get_mm_inputs with no processor
Handle tied weights in depth_embeddings when saving merged model.
The embedding.weight and to_logits.weight are shared in each depth
embedding layer, causing save_pretrained to fail without this fix.
Add detection of merged/exported models (safetensors format) and load
them by first creating base model structure from liquid_audio, then
applying the merged weights from safetensors files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant