Skip to content

Commit 9df70d9

Browse files
authored
Refactor bigdl.llm to ipex_llm (#24)
* Rename bigdl/llm to ipex_llm * rm python/llm/src/bigdl * from bigdl.llm to from ipex_llm
1 parent cc5806f commit 9df70d9

File tree

464 files changed

+918
-940
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

464 files changed

+918
-940
lines changed

README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ You may apply INT4 optimizations to any Hugging Face *Transformers* models as fo
8686

8787
```python
8888
#load Hugging Face Transformers model with INT4 optimizations
89-
from bigdl.llm.transformers import AutoModelForCausalLM
89+
from ipex_llm.transformers import AutoModelForCausalLM
9090
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
9191

9292
#run the optimized model on CPU
@@ -113,7 +113,7 @@ You may apply INT4 optimizations to any Hugging Face *Transformers* models as fo
113113

114114
```python
115115
#load Hugging Face Transformers model with INT4 optimizations
116-
from bigdl.llm.transformers import AutoModelForCausalLM
116+
from ipex_llm.transformers import AutoModelForCausalLM
117117
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
118118

119119
#run the optimized model on Intel GPU

docker/llm/README.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -223,7 +223,7 @@ This controller manages the distributed workers.
223223

224224
##### Launch the model worker(s)
225225
```bash
226-
python3 -m bigdl.llm.serving.model_worker --model-path lmsys/vicuna-7b-v1.3 --device cpu
226+
python3 -m ipex_llm.serving.model_worker --model-path lmsys/vicuna-7b-v1.3 --device cpu
227227
```
228228
Wait until the process finishes loading the model and you see "Uvicorn running on ...". The model worker will register itself to the controller.
229229

@@ -252,7 +252,7 @@ python3 -m fastchat.serve.controller
252252
Then, launch the model worker(s):
253253

254254
```bash
255-
python3 -m bigdl.llm.serving.model_worker --model-path lmsys/vicuna-7b-v1.3 --device cpu
255+
python3 -m ipex_llm.serving.model_worker --model-path lmsys/vicuna-7b-v1.3 --device cpu
256256
```
257257

258258
Finally, launch the RESTful API server
@@ -319,7 +319,7 @@ This controller manages the distributed workers.
319319

320320
##### Launch the model worker(s)
321321
```bash
322-
python3 -m bigdl.llm.serving.model_worker --model-path lmsys/vicuna-7b-v1.3 --device xpu
322+
python3 -m ipex_llm.serving.model_worker --model-path lmsys/vicuna-7b-v1.3 --device xpu
323323
```
324324
Wait until the process finishes loading the model and you see "Uvicorn running on ...". The model worker will register itself to the controller.
325325

@@ -346,7 +346,7 @@ python3 -m fastchat.serve.controller
346346
Then, launch the model worker(s):
347347

348348
```bash
349-
python3 -m bigdl.llm.serving.model_worker --model-path lmsys/vicuna-7b-v1.3 --device xpu
349+
python3 -m ipex_llm.serving.model_worker --model-path lmsys/vicuna-7b-v1.3 --device xpu
350350
```
351351

352352
Finally, launch the RESTful API server

docker/llm/inference/xpu/docker/chat.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
from transformers.tools.agents import StopSequenceCriteria
2424
from transformers.generation.stopping_criteria import StoppingCriteriaList
2525
from colorama import Fore
26-
from bigdl.llm import optimize_model
26+
from ipex_llm import optimize_model
2727
SYSTEM_PROMPT = "A chat between a curious human <human> and an artificial intelligence assistant <bot>.\
2828
The assistant gives helpful, detailed, and polite answers to the human's questions."
2929
HUMAN_ID = "<human>"

docker/llm/serving/cpu/docker/entrypoint.sh

+4-4
Original file line numberDiff line numberDiff line change
@@ -135,9 +135,9 @@ else
135135
done
136136

137137
if [ "$worker_type" == "model_worker" ]; then
138-
worker_type="bigdl.llm.serving.model_worker"
138+
worker_type="ipex_llm.serving.model_worker"
139139
elif [ "$worker_type" == "vllm_worker" ]; then
140-
worker_type="bigdl.llm.serving.vllm_worker"
140+
worker_type="ipex_llm.serving.vllm_worker"
141141
fi
142142

143143
if [[ -n $CONTROLLER_HOST ]]; then
@@ -220,9 +220,9 @@ else
220220
echo "Worker type: $worker_type"
221221
echo "Worker address: $worker_address"
222222
echo "Controller address: $controller_address"
223-
if [ "$worker_type" == "bigdl.llm.serving.model_worker" ]; then
223+
if [ "$worker_type" == "ipex_llm.serving.model_worker" ]; then
224224
python3 -m "$worker_type" --model-path $model_path --device cpu --host $worker_host --port $worker_port --worker-address $worker_address --controller-address $controller_address --stream-interval $stream_interval
225-
elif [ "$worker_type" == "bigdl.llm.serving.vllm_worker" ]; then
225+
elif [ "$worker_type" == "ipex_llm.serving.vllm_worker" ]; then
226226
python3 -m "$worker_type" --model-path $model_path --device cpu --host $worker_host --port $worker_port --worker-address $worker_address --controller-address $controller_address
227227
fi
228228
fi

docker/llm/serving/cpu/docker/model_adapter.py.patch

+1-1
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
generation_config = GenerationConfig.from_pretrained(
1010
model_path, trust_remote_code=True
1111
)
12-
+ from bigdl.llm.transformers import AutoModelForCausalLM
12+
+ from ipex_llm.transformers import AutoModelForCausalLM
1313
model = AutoModelForCausalLM.from_pretrained(
1414
model_path,
1515
config=config,

docker/llm/serving/xpu/docker/entrypoint.sh

+4-4
Original file line numberDiff line numberDiff line change
@@ -66,9 +66,9 @@ else
6666
done
6767

6868
if [ "$worker_type" == "model_worker" ]; then
69-
worker_type="bigdl.llm.serving.model_worker"
69+
worker_type="ipex_llm.serving.model_worker"
7070
elif [ "$worker_type" == "vllm_worker" ]; then
71-
worker_type="bigdl.llm.serving.vllm_worker"
71+
worker_type="ipex_llm.serving.vllm_worker"
7272
fi
7373

7474
if [[ -n $CONTROLLER_HOST ]]; then
@@ -127,9 +127,9 @@ else
127127
echo "Worker address: $worker_address"
128128
echo "Controller address: $controller_address"
129129

130-
if [ "$worker_type" == "bigdl.llm.serving.model_worker" ]; then
130+
if [ "$worker_type" == "ipex_llm.serving.model_worker" ]; then
131131
python3 -m "$worker_type" --model-path $model_path --device xpu --host $worker_host --port $worker_port --worker-address $worker_address --controller-address $controller_address --stream-interval $stream_interval
132-
elif [ "$worker_type" == "bigdl.llm.serving.vllm_worker" ]; then
132+
elif [ "$worker_type" == "ipex_llm.serving.vllm_worker" ]; then
133133
python3 -m "$worker_type" --model-path $model_path --device xpu --host $worker_host --port $worker_port --worker-address $worker_address --controller-address $controller_address
134134
fi
135135
fi

docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/finetune.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ To help you better understand the finetuning process, here we use model [Llama-2
2121
First, load model using `transformers`-style API and **set it to `to('xpu')`**. We specify `load_in_low_bit="nf4"` here to apply 4-bit NormalFloat optimization. According to the [QLoRA paper](https://arxiv.org/pdf/2305.14314.pdf), using `"nf4"` could yield better model quality than `"int4"`.
2222

2323
```python
24-
from bigdl.llm.transformers import AutoModelForCausalLM
24+
from ipex_llm.transformers import AutoModelForCausalLM
2525

2626
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf",
2727
load_in_low_bit="nf4",
@@ -33,14 +33,14 @@ model = model.to('xpu')
3333

3434
Then, we have to apply some preprocessing to the model to prepare it for training.
3535
```python
36-
from bigdl.llm.transformers.qlora import prepare_model_for_kbit_training
36+
from ipex_llm.transformers.qlora import prepare_model_for_kbit_training
3737
model.gradient_checkpointing_enable()
3838
model = prepare_model_for_kbit_training(model)
3939
```
4040

4141
Next, we can obtain a Peft model from the optimized model and a configuration object containing the parameters as follows:
4242
```python
43-
from bigdl.llm.transformers.qlora import get_peft_model
43+
from ipex_llm.transformers.qlora import get_peft_model
4444
from peft import LoraConfig
4545
config = LoraConfig(r=8,
4646
lora_alpha=32,
@@ -54,7 +54,7 @@ model = get_peft_model(model, config)
5454
```eval_rst
5555
.. important::
5656
57-
Instead of ``from peft import prepare_model_for_kbit_training, get_peft_model`` as we did for regular QLoRA using bitandbytes and cuda, we import them from ``bigdl.llm.transformers.qlora`` here to get a BigDL-LLM compatible Peft model. And the rest is just the same as regular LoRA finetuning process using ``peft``.
57+
Instead of ``from peft import prepare_model_for_kbit_training, get_peft_model`` as we did for regular QLoRA using bitandbytes and cuda, we import them from ``ipex_llm.transformers.qlora`` here to get a BigDL-LLM compatible Peft model. And the rest is just the same as regular LoRA finetuning process using ``peft``.
5858
```
5959

6060
```eval_rst

docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/hugging_face_format.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ You may apply INT4 optimizations to any Hugging Face *Transformers* models as fo
55

66
```python
77
# load Hugging Face Transformers model with INT4 optimizations
8-
from bigdl.llm.transformers import AutoModelForCausalLM
8+
from ipex_llm.transformers import AutoModelForCausalLM
99

1010
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
1111
```

docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/inference_on_gpu.md

+5-5
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
2929
3030
# Take Llama-2-7b-chat-hf as an example
3131
from transformers import LlamaForCausalLM
32-
from bigdl.llm import optimize_model
32+
from ipex_llm import optimize_model
3333
3434
model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', torch_dtype='auto', low_cpu_mem_usage=True)
3535
model = optimize_model(model) # With only one line to enable BigDL-LLM INT4 optimization
@@ -40,14 +40,14 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
4040
4141
When running LLMs on Intel iGPUs for Windows users, we recommend setting ``cpu_embedding=True`` in the ``optimize_model`` function. This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
4242
43-
See the `API doc <../../../PythonAPI/LLM/optimize.html#bigdl.llm.optimize_model>`_ for ``optimize_model`` to find more information.
43+
See the `API doc <../../../PythonAPI/LLM/optimize.html#ipex_llm.optimize_model>`_ for ``optimize_model`` to find more information.
4444
4545
Especially, if you have saved the optimized model following setps `here <./optimize_model.html#save>`_, the loading process on Intel GPUs maybe as follows:
4646
4747
.. code-block:: python
4848
4949
from transformers import LlamaForCausalLM
50-
from bigdl.llm.optimize import low_memory_init, load_low_bit
50+
from ipex_llm.optimize import low_memory_init, load_low_bit
5151
5252
saved_dir='./llama-2-bigdl-llm-4-bit'
5353
with low_memory_init(): # Fast and low cost by loading model on meta device
@@ -65,7 +65,7 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
6565
.. code-block:: python
6666
6767
# Take Llama-2-7b-chat-hf as an example
68-
from bigdl.llm.transformers import AutoModelForCausalLM
68+
from ipex_llm.transformers import AutoModelForCausalLM
6969
7070
# Load model in 4 bit, which convert the relevant layers in the model into INT4 format
7171
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', load_in_4bit=True)
@@ -82,7 +82,7 @@ You could choose to use [PyTorch API](./optimize_model.html) or [`transformers`-
8282
8383
.. code-block:: python
8484
85-
from bigdl.llm.transformers import AutoModelForCausalLM
85+
from ipex_llm.transformers import AutoModelForCausalLM
8686
8787
saved_dir='./llama-2-bigdl-llm-4-bit'
8888
model = AutoModelForCausalLM.load_low_bit(saved_dir) # Load the optimized model

docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/langchain_api.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ You may run the models using the LangChain API in `bigdl-llm`.
77
You may run any Hugging Face *Transformers* model (with INT4 optimiztions applied) using the LangChain API as follows:
88

99
```python
10-
from bigdl.llm.langchain.llms import TransformersLLM
11-
from bigdl.llm.langchain.embeddings import TransformersEmbeddings
10+
from ipex_llm.langchain.llms import TransformersLLM
11+
from ipex_llm.langchain.embeddings import TransformersEmbeddings
1212
from langchain.chains.question_answering import load_qa_chain
1313

1414
embeddings = TransformersEmbeddings.from_model_id(model_id=model_path)
@@ -37,8 +37,8 @@ You may also convert Hugging Face *Transformers* models into native INT4 format,
3737
```
3838

3939
```python
40-
from bigdl.llm.langchain.llms import LlamaLLM
41-
from bigdl.llm.langchain.embeddings import LlamaEmbeddings
40+
from ipex_llm.langchain.llms import LlamaLLM
41+
from ipex_llm.langchain.embeddings import LlamaEmbeddings
4242
from langchain.chains.question_answering import load_qa_chain
4343

4444
# switch to ChatGLMEmbeddings/GptneoxEmbeddings/BloomEmbeddings/StarcoderEmbeddings to load other models

docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/native_format.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,13 @@ You may also convert Hugging Face *Transformers* models into native INT4 format
1010

1111
```python
1212
# convert the model
13-
from bigdl.llm import llm_convert
13+
from ipex_llm import llm_convert
1414
bigdl_llm_path = llm_convert(model='/path/to/model/',
1515
outfile='/path/to/output/', outtype='int4', model_family="llama")
1616

1717
# load the converted model
1818
# switch to ChatGLMForCausalLM/GptneoxForCausalLM/BloomForCausalLM/StarcoderForCausalLM to load other models
19-
from bigdl.llm.transformers import LlamaForCausalLM
19+
from ipex_llm.transformers import LlamaForCausalLM
2020
llm = LlamaForCausalLM.from_pretrained("/path/to/output/model.bin", native=True, ...)
2121

2222
# run the converted model

docs/readthedocs/source/doc/LLM/Overview/KeyFeatures/optimize_model.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', torch_
1414

1515
Then, just need to call `optimize_model` to optimize the loaded model and INT4 optimization is applied on model by default:
1616
```python
17-
from bigdl.llm import optimize_model
17+
from ipex_llm import optimize_model
1818

1919
# With only one line to enable BigDL-LLM INT4 optimization
2020
model = optimize_model(model)
@@ -31,7 +31,7 @@ Currently, ``low_bit`` supports options 'sym_int4', 'asym_int4', 'sym_int5', 'as
3131
You may apply symmetric INT8 optimization as follows:
3232

3333
```python
34-
from bigdl.llm import optimize_model
34+
from ipex_llm import optimize_model
3535

3636
# Apply symmetric INT8 optimization
3737
model = optimize_model(model, low_bit="sym_int8")
@@ -51,7 +51,7 @@ model.save_low_bit(saved_dir)
5151

5252
We recommend to use the context manager `low_memory_init` to quickly initiate a model instance with low cost, and then use `load_low_bit` to load the optimized low-bit model as follows:
5353
```python
54-
from bigdl.llm.optimize import low_memory_init, load_low_bit
54+
from ipex_llm.optimize import low_memory_init, load_low_bit
5555
with low_memory_init(): # Fast and low cost by loading model on meta device
5656
model = LlamaForCausalLM.from_pretrained(saved_dir,
5757
torch_dtype="auto",

docs/readthedocs/source/doc/LLM/Overview/llm.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Here, let's take a relatively small LLM model, i.e [open_llama_3b_v2](https://hu
1111
Simply use one-line `transformers`-style API in `bigdl-llm` to load `open_llama_3b_v2` with INT4 optimization (by specifying `load_in_4bit=True`) as follows:
1212

1313
```python
14-
from bigdl.llm.transformers import AutoModelForCausalLM
14+
from ipex_llm.transformers import AutoModelForCausalLM
1515

1616
model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="openlm-research/open_llama_3b_v2",
1717
load_in_4bit=True)

docs/readthedocs/source/doc/LLM/Quickstart/install_linux_gpu.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,7 @@ Install the Miniconda as follows if you don't have conda installed on your machi
112112
113113
python
114114
115-
> from bigdl.llm.transformers import AutoModel, AutoModelForCausalLM
115+
> from ipex_llm.transformers import AutoModel, AutoModelForCausalLM
116116
```
117117
118118
> <img src="https://llm-assets.readthedocs.io/en/latest/_images/verify_bigdl_import.png" alt="image-20240221102252562" width=100%; />
@@ -170,7 +170,7 @@ Now let's play with a real LLM. We'll be using the [phi-1.5](https://huggingface
170170
```python
171171
# Copy/Paste the contents to a new file demo.py
172172
import torch
173-
from bigdl.llm.transformers import AutoModelForCausalLM
173+
from ipex_llm.transformers import AutoModelForCausalLM
174174
from transformers import AutoTokenizer, GenerationConfig
175175
generation_config = GenerationConfig(use_cache = True)
176176

docs/readthedocs/source/doc/LLM/Quickstart/install_windows_gpu.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,7 @@ You can verify if `bigdl-llm` is successfully installed by simply running a few
130130
* Step 5: Copy following code to Anaconda prompt **line by line** and press Enter **after copying each line**.
131131
```python
132132
import torch
133-
from bigdl.llm.transformers import AutoModel,AutoModelForCausalLM
133+
from ipex_llm.transformers import AutoModel,AutoModelForCausalLM
134134
tensor_1 = torch.randn(1, 1, 40, 128).to('xpu')
135135
tensor_2 = torch.randn(1, 1, 128, 40).to('xpu')
136136
print(torch.matmul(tensor_1, tensor_2).size())
@@ -200,7 +200,7 @@ Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://hugg
200200
201201
# Copy/Paste the contents to a new file demo.py
202202
import torch
203-
from bigdl.llm.transformers import AutoModelForCausalLM
203+
from ipex_llm.transformers import AutoModelForCausalLM
204204
from transformers import AutoTokenizer, GenerationConfig
205205
generation_config = GenerationConfig(use_cache=True)
206206
@@ -260,7 +260,7 @@ Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://hugg
260260
261261
# Copy/Paste the contents to a new file demo.py
262262
import torch
263-
from bigdl.llm.transformers import AutoModelForCausalLM
263+
from ipex_llm.transformers import AutoModelForCausalLM
264264
from transformers import GenerationConfig
265265
from modelscope import AutoTokenizer
266266
generation_config = GenerationConfig(use_cache=True)

0 commit comments

Comments
 (0)