Skip to content

Latest commit

 

History

History

llama.cpp

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

(Experimental) Example of running GGUF model using llama.cpp C++ API on NPU

In this directory, you will find a simple C++ example on how to run GGUF models on Intel NPUs using llama.cpp C++ API. See the table blow for verified models.

Verified Models

Model Model link
LLaMA 3.2 meta-llama/Llama-3.2-3B-Instruct
DeepSeek-R1 deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

Please refer to Quickstart for details about verified platforms.

0. Prerequisites

For ipex-llm NPU support, please refer to Quickstart for details about the required preparations.

1. Install & Runtime Configurations

1.1 Installation on Windows

We suggest using conda to manage environment:

conda create -n llm python=3.11
conda activate llm

:: for building the example
pip install cmake

:: install ipex-llm with 'npu' option
pip install --pre --upgrade ipex-llm[npu]

Please refer to Quickstart for more details about ipex-llm installation on Intel NPU.

1.2 Runtime Configurations

Please refer to Quickstart for environment variables setting based on your device.

2. Build C++ Example simple

  • You can run below cmake script in cmd to build simple by yourself, don't forget to replace below <CONDA_ENV_DIR> with your own path.
:: under current directory
:: please replace below conda env dir with your own path
set CONDA_ENV_DIR=C:\Users\arda\miniforge3\envs\llm\Lib\site-packages
mkdir build
cd build
cmake ..
cmake --build . --config Release -j
cd Release
  • You can also directly use our released simple.exe which has the same usage as this example simple.cpp

3. Run simple

With built simple, you can run the GGUF model

# Run simple text completion
simple.exe -m <gguf_model_path> -n 64 -p "Once upon a time,"

Note:

Warmup on first run: When running specific GGUF models on NPU for the first time, you might notice delays up to several minutes before the first token is generated. This delay occurs because the blob compilation.