-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Enable multiple LoRa adapters #2010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
41 commits
Select commit
Hold shift + click to select a range
db3d8e6
feat: first draft load multiple lora
drbh 0a6ea7f
feat: load weights within layer and refactor lora pass
drbh a046c30
fix: refactor and reduce lora math
drbh c661631
feat: baseline impl single request multi lora support
drbh 8b50f4b
feat: prefer lorax implementation and port loading logic
drbh d5f21d5
fix: prefer adapter_data and refactors
drbh 8984ce6
feat: perfer loraxs custom punica kernels and add mlp loras
drbh ad088d5
fix: adjust batch for bgmv
drbh c927376
fix: adjust adapter_segments logic when in batch
drbh 73eb2ae
fix: refactor and move changes to v3 proto
drbh 88bd5c2
fix: pass model_id for all flash causal lms
drbh dc0f765
fix: pass model_id for all causal and seq2seq lms
drbh 9c45d34
fix: add model_id to model test
drbh de56a81
feat: add lora support to mistral and refactors
drbh 68399c1
feat: prefer model id in request
drbh 81707bf
fix: include rust code for adapter id
drbh 43ec9df
feat: bump launcher and add new lora docs
drbh 611225f
feat: support base model generation and refactors
drbh a563a93
fix: rename doc to retry ci build
drbh 91f4072
feat: support if vlm models
drbh b116927
fix: add adapter_data param and avoid missing layers
drbh 1deb372
fix: add adapter_data param to phi and neox
drbh 101b95a
fix: update all models forwards to include adapter_data
drbh ce40ad2
fix: add model_id to IdeficsCausalLM
drbh 1be1ebc
Update lora.md
datavistics d6cf63c
Update lora.md
datavistics aa88c4f
fix: add lora kernel to dockerfile, support running without kernels a…
drbh 06c3254
fix: avoid dockerfile conflict
drbh 0e1c28c
fix: merge 'main' into lora-internal to resolve conflicts
drbh 1104885
Merge branch 'main' into lora-internal
drbh 224455f
Merge branch 'main' into lora-internal
drbh 4f1543d
fix: refactors and adjust flash llama lora logic
drbh ce70fce
fix: skip llama test due to CI issue (temp)
drbh c9e4526
fix: skip llama test CI (temp) 2
drbh a07b612
fix: revert skips and prefer updated ci token for tests
drbh 3c9b28e
fix: refactors and helpful comments
drbh c927cff
fix: add noop in TensorParallelAdapterRowLinear too
drbh f94f2b3
fix: refactor and move shard_lora_weights logic
drbh 0d496ba
Merge branch 'main' into lora-internal
drbh a2d821c
fix: exit early if no adapter_data
drbh 59575fe
Merge branch 'main' into lora-internal
drbh File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,65 @@ | ||
| # LoRA (Low-Rank Adaptation) | ||
|
|
||
| ## What is LoRA? | ||
|
|
||
| LoRA is a technique that allows for efficent fine-tuning a model while only updating a small portion of the model's weights. This is useful when you have a large model that has been pre-trained on a large dataset, but you want to fine-tune it on a smaller dataset or for a specific task. | ||
|
|
||
| LoRA works by adding a small number of additional weights to the model, which are used to adapt the model to the new dataset or task. These additional weights are learned during the fine-tuning process, while the rest of the model's weights are kept fixed. | ||
|
|
||
| ## How is it used? | ||
|
|
||
| LoRA can be used in many ways and the community is always finding new ways to use it. Here are some examples of how you can use LoRA: | ||
|
|
||
| Technically, LoRA can be used to fine-tune a large language model on a small dataset. However, these use cases can span a wide range of applications, such as: | ||
|
|
||
| - fine-tuning a language model on a small dataset | ||
| - fine-tuning a language model on a domain-specific dataset | ||
| - fine-tuning a language model on a dataset with limited labels | ||
|
|
||
| ## Optimizing Inference with LoRA | ||
|
|
||
| LoRA's can be used during inference by mutliplying the adapter weights with the model weights at each specified layer. This process can be computationally expensive, but due to awesome work by [punica-ai](https://github.com/punica-ai/punica) and the [lorax](https://github.com/predibase/lorax) team, optimized kernels/and frameworks have been developed to make this process more efficient. TGI leverages these optimizations in order to provide fast and efficient inference with mulitple LoRA models. | ||
|
|
||
| ## Serving multiple LoRA adapters with TGI | ||
|
|
||
| Once a LoRA model has been trained, it can be used to generate text or perform other tasks just like a regular language model. However, because the model has been fine-tuned on a specific dataset, it may perform better on that dataset than a model that has not been fine-tuned. | ||
|
|
||
| In practice its often useful to have multiple LoRA models, each fine-tuned on a different dataset or for a different task. This allows you to use the model that is best suited for a particular task or dataset. | ||
|
|
||
| Text Generation Inference (TGI) now supports loading multiple LoRA models at startup that can be used in generation requests. This feature is available starting from version `~2.0.6` and is compatible with LoRA models trained using the `peft` library. | ||
|
|
||
| ### Specifying LoRA models | ||
|
|
||
| To use LoRA in TGI, when starting the server, you can specify the list of LoRA models to load using the `LORA_ADAPTERS` environment variable. For example: | ||
|
|
||
| ```bash | ||
| LORA_ADAPTERS=predibase/customer_support,predibase/dbpedia | ||
| ``` | ||
|
|
||
| In the server logs, you will see the following message: | ||
|
|
||
| ```txt | ||
| Loading adapter weights into model: predibase/customer_support | ||
| Loading adapter weights into model: predibase/dbpedia | ||
| ``` | ||
|
|
||
| ## Generate text | ||
|
|
||
| You can then use these models in generation requests by specifying the `lora_model` parameter in the request payload. For example: | ||
|
|
||
| ```json | ||
| curl 127.0.0.1:3000/generate \ | ||
| -X POST \ | ||
| -H 'Content-Type: application/json' \ | ||
| -d '{ | ||
| "inputs": "Hello who are you?", | ||
| "parameters": { | ||
| "max_new_tokens": 40, | ||
| "adapter_id": "predibase/customer_support" | ||
| } | ||
| }' | ||
| ``` | ||
|
|
||
| > **Note:** The Lora feature is new and still being improved. If you encounter any issues or have any feedback, please let us know by opening an issue on the [GitHub repository](https://github.com/huggingface/text-generation-inference/issues/new/choose). Additionally documentation and an improved client library will be published soon. | ||
|
|
||
| An updated tutorial with detailed examples will be published soon. Stay tuned! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| lorax_punica_commit := c71861a653412267dc27ec86013dd945ce3474bc | ||
|
|
||
| build-lorax-punica: | ||
| if [ ! -d 'lorax-punica' ]; then \ | ||
| git clone --no-checkout https://github.com/predibase/lorax.git lorax-punica; \ | ||
| fi | ||
| cd lorax-punica && git sparse-checkout set server/punica_kernels && git checkout $(lorax_punica_commit) | ||
| cd lorax-punica && git submodule update --init --recursive | ||
| cd lorax-punica/server/punica_kernels && python setup.py build | ||
|
|
||
| install-lorax-punica: build-lorax-punica | ||
| cd lorax-punica/server/punica_kernels && python setup.py install | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| # Origin: https://github.com/predibase/lorax | ||
| # Path: lorax/server/lorax_server/adapters/__init__.py | ||
| # License: Apache License Version 2.0, January 2004 | ||
|
|
||
| from text_generation_server.adapters.weights import ( | ||
| AdapterBatchData, | ||
| AdapterBatchMetadata, | ||
| ) | ||
|
|
||
| __all__ = [ | ||
| "AdapterBatchData", | ||
| "AdapterBatchMetadata", | ||
| ] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| # Origin: https://github.com/predibase/lorax | ||
| # Path: lorax/server/lorax_server/adapters/config.py | ||
| # License: Apache License Version 2.0, January 2004 | ||
|
|
||
| from abc import ABC, abstractmethod | ||
| from dataclasses import dataclass | ||
| from typing import TYPE_CHECKING, Dict, Optional, Set, Tuple | ||
|
|
||
| import torch | ||
|
|
||
| from text_generation_server.adapters.weights import AdapterWeights | ||
|
|
||
| if TYPE_CHECKING: | ||
| from text_generation_server.models.model import Model | ||
|
|
||
|
|
||
| @dataclass | ||
| class ModuleMap: | ||
| module_name: str | ||
| module_weights: Dict[str, Tuple[torch.Tensor, str]] | ||
|
|
||
|
|
||
| @dataclass | ||
| class AdapterConfig(ABC): | ||
| base_model_name_or_path: str | ||
|
|
||
| @abstractmethod | ||
| def map_weights_for_model( | ||
| self, | ||
| adapter_weights: Dict[int, AdapterWeights], | ||
| weight_names: Tuple[str], | ||
| ) -> Tuple[ModuleMap, Set[str]]: | ||
| pass | ||
|
|
||
| @abstractmethod | ||
| def load_batched_adapter_weights( | ||
| self, | ||
| model: "Model", | ||
| module_map: ModuleMap, | ||
| layer_type: str, | ||
| unused_weight_names: Set[str], | ||
| dynamic: bool, | ||
| ) -> Optional[AdapterWeights]: | ||
| pass |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.