-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading bigger models is very slow using AutoModelForCausalLM.from_pretrained
#562
Comments
This feels like an accelerate issue, so pinging @SunMarc and @muellerzr once again, but yell if I should ping someone else! |
It shouldn't take that much time to load these models. Can you try reproduce this on somehow on colab with a smaller model or it is only happening on your machine ? From what you said, loading a 7B might take you 10min which is way too long when using device_map |
cc @Wauplin 🤗 |
cc @Narsil |
Hi @Narsil, @Wauplin & @ArthurZucker Please let me know if any fixes are possible for this. I suspect the same thing happening when I am loading the model with vLLM as well - so there might be a broader issue. |
Hi Team, 🙋🏻 Circling back on this issue again - please let me know any potential fixes available or any direction to explore it further. |
Have you seen this? |
This problem goes away when I converted the weights to |
I thought |
ProblemThe issue here for everyone here seems to be mounted network disks. A lot of providers, AWS, colab etc.. actually use network disks mounted as local which have super high latency. Network is extremely slow, and therefore issuing many reads, is about the slowest possibly pattern you can do. There are ways to mitigate that when you are in charge of the mounting (by forcing buffered reads essentially). The reason this doesn't occur when you just downloaded the model is that the file is actually still kept in RAM by the OS (the OS does pretty much everything possible to never read the disk, as it's the slowest possible operation it can do after network). The issue also exists on Windows WSL where the filesystem (or the memory mapping I'm not sure) is also extremely slow. Solution1/ Is is very easy to solve that issue, just replace from safetensors.torch import load
with open(filename, "wb") as f:
tensors = load(f.read()) What this will do is force a single read for the entire file, read everything in RAM first, and then move everything whereever you want. This is what loading PTH file do by default and why they are "faster" in this use case (and why we don't want to do this by default is following). 2/ You could even, just preamble 3/ Another option, which I also recommend is to actually save the models on the NVMe that AWS provides with the most expensive machines. Those are actually local and super fast, even if you actually have to read from disk, the loads are going to be quite fast. NoteNow there's an issue with loading the file directly into RAM all the time. The reason is that most large models (30B+) are usually too large to fit a single GPU (it's always cheaper to run them on smaller hardware and multiple GPUs if they do). Those speeds are only possible because we do not load everything at once, and load only what we need. The problem are the direct issue from various parts of the system masquerading as others. |
An example of a downstream fix: huggingface/diffusers#10305 |
On Windows (at least) the problem exists when reading from anything that isn't a locally-attached NVMe drive. Even SATA SSD's with a typical throughput of ~520MB/s will read safetensors at perhaps 160-220MB/s. This is the best case and it gets worse from there. Last summer I grabbed a Procmon64 capture of storage activity during a safetensor load and it wasn't pretty. NVMe is the ideal medium for dealing with this choppy, seeky, noncontiguous, and arguably pathological small-block I/O pattern; with everything else the micro (or milli)seconds of extra latency for each I/O really adds-up. Spinning hard disk is the worst case as what should be a long, steady stream of large sequential I/Os (which hard drives handle quite well) is actually an I/O blender that confounds read-ahead caching and wastes time with excessive seek activity. I maintain that good development practice will accommodate HDDs along with any reasonable amount of latency getting to/from secondary storage. It's clear that many devs are blessed with entirely local low-latency storage given the plethora of arm-twisting, gaslighting, and outright dismissal which surrounded this issue throughout the entirety of 2024. |
System Info
transformers
version: 4.45.0Who can help?
@ArthurZucker @SunMarc
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I am spawning an
g5.12xlarge
GPU machine on AWS sagemaker and I am loading a locally saved model using this script:This is the problem with almost all the models I am trying -
rhymes-ai/Aria
can be used to reproduce it.Expected behavior
The last line takes forever to load the model (>40-50 mins). I have observed the same behaviour for multiple other models as well.
Things I have tried/observed:
device_map
of an already loaded model and passed it on to the loading constructor asdevice_map
instead of usingauto
but it didn't not solve the issue.0
so no leftover layers are remaining from the first load.The text was updated successfully, but these errors were encountered: