Can MPS use FP16 when training?Why I can't? #32648

AimoneAndex · 2024-08-13T05:36:38Z

System Info

Device:Apple M3 Pro
OS:macOS Sonoma 14.1
packages:
datasets 2.20.1.dev0
evaluate 0.4.2
huggingface-hub 0.23.5
tokenizers 0.19.1
torch 2.5.0.dev20240717
torchaudio 2.4.0.dev20240717
torchvision 0.20.0.dev20240717

Who can help?

@ArthurZucker @muellerzr

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

import os
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
GenerationConfig,
DataCollatorForSeq2Seq
)
from datasets import Dataset, load_dataset
from peft import LoraConfig, TaskType, get_peft_model, PeftModel, PeftConfig
import torch

ds_name = input('请输入等待训练的数据集（csv文件）名称（不包含后缀）')
model_name = input('请输入等待训练的模型名称（子文件夹名称）')
save_name = input('请输入希望保存的lora名称')

current_dir = os.getcwd()
save_dir = os.path.join(current_dir, 'model_saved', save_name)
os.makedirs(save_dir, exist_ok=True)

target_file_path = os.path.join(current_dir, 'datasets', ds_name + '.csv')
model_dir = os.path.join(current_dir, 'model', model_name)

dataset = load_dataset("csv", data_files=target_file_path, split="train")

tokenizer = AutoTokenizer.from_pretrained(model_dir)
tokenizer.padding_side = "right"
tokenizer.pad_token_id = 2

def process_func(example):
MAX_LENGTH = 384
instruction = example.get("instruction", "")
input_text = example.get("input", "")
prompt = f"Human: {instruction}\n{input_text}".strip() if input_text else f"Human: {instruction}".strip()
instruction_tokenized = tokenizer(prompt + "\n\nAssistant: ", add_special_tokens=False)
response_tokenized = tokenizer(example["output"], add_special_tokens=False
input_ids = instruction_tokenized["input_ids"] + response_tokenized["input_ids"] + [tokenizer.eos_token_id]
attention_mask = instruction_tokenized["attention_mask"] + response_tokenized["attention_mask"] + [1]
labels = [-100] * len(instruction_tokenized["input_ids"]) + response_tokenized["input_ids"] + [tokenizer.eos_token_id]
if len(input_ids) > MAX_LENGTH:
input_ids = input_ids[:MAX_LENGTH]
attention_mask = attention_mask[:MAX_LENGTH]
labels = labels[:MAX_LENGTH]
return {
"input_ids": input_ids,
"attention_mask": attention_mask,
"labels": labels
}

tokenized_dataset = dataset.map(process_func, remove_columns=dataset.column_names)
print(tokenized_dataset)

device = torch.device("mps")
model = AutoModelForCausalLM.from_pretrained(
model_dir,
low_cpu_mem_usage=True,
torch_dtype=torch.half
)
model = model.to(device)

config = LoraConfig(task_type=TaskType.CAUSAL_LM)
model = get_peft_model(model, config)
model.print_trainable_parameters()
model=model.half

args = TrainingArguments(
output_dir=save_dir,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
logging_steps=10,
num_train_epochs=2,
)

trainer = Trainer(
model=model,
args=args,
train_dataset=tokenized_dataset,
data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
)

trainer.train()

Expected behavior

Please let transformers not show me the error below again.Thanks for everyone!
ValueError Traceback (most recent call last)
Cell In[16], line 1
----> 1 trainer = Trainer(
2 model=model,
3 args=args,
4 train_dataset=tokenized_dataset,
5 data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
6 )

File ~/Data/AIHub/Trans-Penv/transformers/src/transformers/trainer.py:409, in Trainer.init(self, model, args, data_collator, train_dataset, eval_dataset, tokenizer, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics)
406 self.deepspeed = None
407 self.is_in_train = False
--> 409 self.create_accelerator_and_postprocess()
411 # memory metrics - must set up as early as possible
412 self._memory_tracker = TrainerMemoryTracker(self.args.skip_memory_metrics)

File ~/Data/AIHub/Trans-Penv/transformers/src/transformers/trainer.py:4648, in Trainer.create_accelerator_and_postprocess(self)
4645 args.update(accelerator_config)
4647 # create accelerator object
-> 4648 self.accelerator = Accelerator(**args)
4649 # some Trainer classes need to use gather instead of gather_for_metrics, thus we store a flag
4650 self.gather_function = self.accelerator.gather_for_metrics

File /opt/anaconda3/envs/tfs/lib/python3.12/site-packages/accelerate/accelerator.py:467, in Accelerator.init(self, device_placement, split_batches, mixed_precision, gradient_accumulation_steps, cpu, dataloader_config, deepspeed_plugin, fsdp_plugin, megatron_lm_plugin, rng_types, log_with, project_dir, project_config, gradient_accumulation_plugin, dispatch_batches, even_batches, use_seedable_sampler, step_scheduler_with_optimizer, kwargs_handlers, dynamo_backend)
...
--> 467 raise ValueError(f"fp16 mixed precision requires a GPU (not {self.device.type!r}).")
468 kwargs = self.scaler_handler.to_kwargs() if self.scaler_handler is not None else {}
469 if self.distributed_type == DistributedType.FSDP:

ValueError: fp16 mixed precision requires a GPU (not 'mps').

The text was updated successfully, but these errors were encountered:

muellerzr · 2024-08-13T12:35:09Z

Keeping the other issue closed and commenting over here: #32035 (comment)

TL;DR it's in the torch nightlies, PyTorch only merged support like last week. Once it's on a stable release we'll enable it

AimoneAndex · 2024-08-23T11:51:36Z

Keeping the other issue closed and commenting over here: #32035 (comment)

TL;DR it's in the torch nightlies, PyTorch only merged support like last week. Once it's on a stable release we'll enable it

OK!Thanks a lot!

github-actions · 2024-09-17T08:04:35Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

haideraqeeb · 2025-02-14T09:53:13Z

please look into this issue, i am getting the same problem

AimoneAndex · 2025-02-17T10:17:01Z

@muellerzr I asked Pytorch and they told me to ask huggingface since Pytorch had already added this function.So how it is now?

AimoneAndex added the bug label Aug 13, 2024

amyeroberts added the Accelerate label Aug 13, 2024

andimarafioti linked a pull request Aug 13, 2024 that will close this issue

Add Idefics 3! #32473

Merged

5 tasks

andimarafioti removed a link to a pull request Aug 13, 2024

Add Idefics 3! #32473

Merged

5 tasks

github-actions bot closed this as completed Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can MPS use FP16 when training?Why I can't? #32648

Can MPS use FP16 when training?Why I can't? #32648

AimoneAndex commented Aug 13, 2024 •

edited

Loading

muellerzr commented Aug 13, 2024 •

edited

Loading

AimoneAndex commented Aug 23, 2024

github-actions bot commented Sep 17, 2024

haideraqeeb commented Feb 14, 2025

AimoneAndex commented Feb 17, 2025

Can MPS use FP16 when training?Why I can't? #32648

Can MPS use FP16 when training?Why I can't? #32648

Comments

AimoneAndex commented Aug 13, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

muellerzr commented Aug 13, 2024 • edited Loading

AimoneAndex commented Aug 23, 2024

github-actions bot commented Sep 17, 2024

haideraqeeb commented Feb 14, 2025

AimoneAndex commented Feb 17, 2025

AimoneAndex commented Aug 13, 2024 •

edited

Loading

muellerzr commented Aug 13, 2024 •

edited

Loading