Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device Movement Error with 4-bit Quantized LLaMA 3.1 Model Loading #36272

Open
3 of 6 tasks
Pritidhrita opened this issue Feb 19, 2025 · 1 comment
Open
3 of 6 tasks

Comments

@Pritidhrita
Copy link

Pritidhrita commented Feb 19, 2025

System Info

I'm running into a persistent issue when trying to load the LLaMA 3.1 8B model with 4-bit quantization. No matter what configuration I try, I get this error during initialization:

pgsql
Copy
CopyValueError: `.to` is not supported for `4-bit` or `8-bit` bitsandbytes models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Environment:

Python: 3.10
Transformers: Latest version
PyTorch: Latest version
GPU: 85.05 GB memory available
CUDA: Properly installed and available
What I've tried:

Loading with a BitsAndBytesConfig:
python
Copy
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
llm_int8_has_fp16_weight=True
)

base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
quantization_config=bnb_config,
trust_remote_code=True,
use_cache=True,
device_map='auto',
max_memory={0: "24GiB"}
)
Loading without device mapping:
python
Copy
model_kwargs = {
"trust_remote_code": True,
"load_in_4bit": True,
"torch_dtype": torch.float16,
"use_cache": True
}

Expected behavior

Clearing CUDA cache and running garbage collection beforehand.
Experimenting with different device mapping strategies.
Even with an ample GPU memory (85.05 GB) and confirmed CUDA availability, I still can't seem to get the model to load without running into this device movement error. Other models load fine when using quantization, so I'm not sure what's special about this setup.

Any ideas on how to resolve this or work around the error? Thanks in advance for your help!

Checklist

@Rocketknight1
Copy link
Member

cc @SunMarc @muellerzr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants