Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blogs/a-note-on-deepseek-r1 #8

Open
utterances-bot opened this issue Feb 3, 2025 · 41 comments
Open

blogs/a-note-on-deepseek-r1 #8

utterances-bot opened this issue Feb 3, 2025 · 41 comments

Comments

@utterances-bot
Copy link

A Note on DeepSeek R1 Deployment

Your browser does not support the video tag. This is a (minimal) note on deploying DeepSeek R1 671B (the full version without distillation) locally with olla...

https://snowkylin.github.io/blogs/a-note-on-deepseek-r1.html

Copy link

SlavikCA commented Feb 3, 2025

Thank you for the guide.
I'm running it now, but see these warnings in the Ollama log:

WARN source=server.go:216 msg="flash attention enabled but not supported by model"

WARN source=server.go:234 msg="quantized kv cache requested but flash attention disabled" type=q8_0

So, looks like DeepSeek model doesn't support flash attention.

Copy link

不错的文章,follow了

Copy link

很棒,感谢佬提供的数据,准备配双路EPYC9654跑Q8的满血版,但目前了解到的数据显示这配置最快才8-9tps,混合推理能加速的话加上四张2080ti应该就可用了

Copy link

lcgogo commented Feb 6, 2025

https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ1_M 下载的是3-4个 .gguf 结尾的文件,那 /home/snowkylin/DeepSeek-R1-UD-IQ1_M.gguf 具体指的是哪一个?

Copy link

lcgogo commented Feb 6, 2025

ignore, 没注意 Note 里有合并流程

Copy link

lcgogo commented Feb 6, 2025

llama-gguf-split --merge DeepSeek-R1-UD-IQ1_M-00001-of-00004.gguf DeepSeek-R1-UD-IQ1_S.gguf

这个应该是
llama-gguf-split --merge DeepSeek-R1-UD-IQ1_M-00001-of-00004.gguf DeepSeek-R1-UD-IQ1_M.gguf

Copy link
Owner

@lcgogo 感谢提醒,已经修正

Copy link

为啥我的DeepSeek-R1-UD-IQ1_M版好蠢,没有思考过程,且输出也不对。

How many ‘r’s is in the word ‘strawberry’?

The word "strawberry" contains 2 'r's.

Copy link
Owner

@KKIverson 1. 请检查是否从 https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ1_M 下载模型(总大小约158G,一般网络需要数十个小时下载) 2. 检查TEMPLATE是否正确 3. 多生成几次看看 4. 尝试问其他问题

@KKIverson
Copy link

@snowkylin 感谢大佬,重新ollama run之后似乎又变好了。

Copy link

还有个问题想请教佬,文中提到DeepSeek-R1-Q4_K_M的推理速度是2-4 tokens/s,这是在使用gpu混合推理后的速度还是纯cpu推理的速度呢?如果是混合推理的速度,不知它是和动态量化版本运行时起到了相同的加速效果,还是由于pcie带宽太低反向拉低了推理速度呢

Copy link

nilecui commented Feb 7, 2025

你们的速度怎么样?加载到卡里用到了50G,GPU利用不起来,如何更高效的利用GPU?长文本1-2token/s?

Copy link

lcgogo commented Feb 7, 2025

我试了下,在A100上,跑 DeepSeek-R1-UD-IQ1_M
PARAMETER num_gpu 28 # 跑不起来,报 OOM,改成 14 能跑起来,但是推理好慢

# ollama ps
NAME                           ID              SIZE      PROCESSOR          UNTIL
DeepSeek-R1-UD-IQ1_M:latest    ffd382f7c6af    213 GB    68%/32% CPU/GPU    Forever
llama3.2:3b                    a80c4f17acd5    4.0 GB    100% GPU           Forever

结果好慢,推理很久(PARAMETER num_gpu 14 用了 17m)

This method confirms that **9.8** is greater than **9.11**.

total duration:       16m56.764149465s
load duration:        20.737375ms
prompt eval count:    17 token(s)
prompt eval duration: 4.992s
prompt eval rate:     3.41 tokens/s
eval count:           1449 token(s)
eval duration:        16m51.75s
eval rate:            1.43 tokens/s
>>> /bye

奇怪的是我有2块A100,但是 ollama 只用了一块的显存,PARAMETER num_gpu 14 大概能用50G,PARAMETER num_gpu 20 大概能用 70G

# nvidia-smi
Fri Feb  7 09:57:08 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15              Driver Version: 570.86.15      CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:00:05.0 Off |                    0 |
| N/A   34C    P0             71W /  400W |    3741MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off |   00000000:00:06.0 Off |                    0 |
| N/A   32C    P0             72W /  400W |   70553MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           27168      C   ...a_v12_avx/ollama_llama_server       3732MiB |
|    1   N/A  N/A           74796      C   ...a_v12_avx/ollama_llama_server      70544MiB |
+-----------------------------------------------------------------------------------------+

Copy link

lcgogo commented Feb 7, 2025

奇怪的是我有2块A100,但是 ollama 只用了一块的显存,PARAMETER num_gpu 14 大概能用50G,PARAMETER num_gpu 20 大概能用 70G

# nvidia-smi
Fri Feb  7 09:57:08 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15              Driver Version: 570.86.15      CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:00:05.0 Off |                    0 |
| N/A   34C    P0             71W /  400W |    3741MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off |   00000000:00:06.0 Off |                    0 |
| N/A   32C    P0             72W /  400W |   70553MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           27168      C   ...a_v12_avx/ollama_llama_server       3732MiB |
|    1   N/A  N/A           74796      C   ...a_v12_avx/ollama_llama_server      70544MiB |
+-----------------------------------------------------------------------------------------+

@KKIverson
Copy link

你们的速度怎么样?加载到卡里用到了50G,GPU利用不起来,如何更高效的利用GPU?长文本1-2token/s?

@nilecui 我们是2张A800,跑DeepSeek-R1-UD-IQ1_M.gguf,配置如下
PARAMETER num_gpu 48
PARAMETER num_ctx 4096

本来我想把length设置成8k或者16k,结果报kv cache,out of memory,所以设置成了4k。每张卡占用量60 - 70多G。
serve起来后对话的时候很迷,有时候输出think过程,有时候不输出;速度方面,有时候很慢,有时候又稍微快一点。

Copy link
Owner

文中提到DeepSeek-R1-Q4_K_M的推理速度是2-4 tokens/s,这是在使用gpu混合推理后的速度还是纯cpu推理的速度呢?

@canghaimeng 是使用GPU混合推理后的速度。因为测试设备内存有限(384G),无法使用纯CPU推理4-bit模型,所以无法实际测试比较4-bit下纯CPU推理和混合推理的速度,不过我倾向于认为GPU还是起到了一定加速效果。

Copy link

qfb594 commented Feb 7, 2025

Why does my model fail to run?
Could not respond to message.
Your Ollama instance could not be reached or is not responding. Please make sure it is running the API server and your connection information is correct in AnythingLLM.

Copy link
Owner

@qfb594 Are you using AngthingLLM? I haven't used it yet, so not sure what the problem is. Maybe try ollama run your-model-name in console to see whether ollama runs correctly.

Copy link

qfb594 commented Feb 8, 2025

运行模型报错 Error: Post "http://127.0.0.1:11434/api/generate": read tcp 127.0.0.1:58813->127.0.0.1:11434: wsarecv: An existing connection was forcibly closed by the remote host. 运行报错这个 但是我用千问的模型就可以正常跑起来,请问大佬如何解决?

Copy link

CHN-STUDENT commented Feb 8, 2025

感谢佬们,借了8张L20准备部署一个试试,llama.cpp 刚部署好,正准备下一个UD-IQ2_XXS玩玩,国内为了加速需要 hf-mirror 换源么,各位可以指点下么,刚上手。

UPDATE: 这是我部署笔记,记录和参考,再次感谢 @snowkylin 大佬做出的先行研究,参考了很多,很有启发!

RunDeepSeek-R1-UD-IQ2_XXS_OnUbuntu24.04.md

Copy link

1 + 1 等于 2。这是基础的加法运算结果。

DeepSeek-R1-IQ1_M_671b:latest
Use Ctrl + d or /bye to exit.
How many ‘r’s are in the word ‘strawberry’?
%>-+><|begin▁of▁sentence|>+5@A

How many ‘r’s are in the word ‘strawberry’?
2<|begin▁of▁sentence|>-!/=7?<|▁pad▁|>C#(2<|begin▁of▁sentence|><|▁pad▁|>98)8(#!B=%!48-AE&:?0@AD

输出一堆乱码,有大佬知道咋回事吗?

Copy link

第一个问题无法闭合,第二个问题就不出现出现乱码

Copy link

root@196cbfc9c720:/deepseek# ollama run DeepSeek-R1-UD-IQ1_S:latest
Error: llama runner process has terminated: GGML_ASSERT(hparams.n_expert <= LLAMA_MAX_EXPERTS) failed
遇到了这个问题

Copy link

It's cool. I deploy Deepseek R1 on one GPU - AMD Instinct MI300X refer to it . Sharing my experience at https://medium.com/@alexhe.amd/deploy-deepseek-r1-in-one-gpu-amd-instinct-mi300x-7a9abeb85f78

Copy link

作者你好,我在8卡4090机器上运行DeepSeek-R1-UD-IQ1_S模型,num_gpu 61,运行起来模型的时候,每秒大概14tokens,但是为什么CPU利用率很大呀,一直都是2000%多,有什么解决方法吗?

Copy link

CHN-STUDENT commented Feb 12, 2025

目前进展:经常出现context shift is disabled,正在研究
试了试 llama-server,感觉速度吊打 ollama + openwebui方案啊,等我研究下看看有没其他面板

# 环境变量设置
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1

llama-server \
    --model /data/models/deepseek/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ2_XXS/DeepSeek-R1-UD-IQ2_XXS.gguf \
    --cache-type-k q4_0 \
    --threads 16 \
    --n-gpu-layers 62 \
    --temp 0.6 \
    --ctx-size 8192 \
    --prio 2 \
    --seed 3407 \
    --host 0.0.0.0 \
    --port 8088 \
    --tensor-split 0.125,0.125,0.125,0.125,0.125,0.125,0.125,0.125 \
    --mlock \
    --flash-attn \
    --np 4

Copy link

ollama create DeepSeek-R1-UD-IQ1_M -f DeepSeekQ1_Modelfile
创建模型的时候,运行这个你们等了多久啊,我一直卡在gathering model components中,想问下

@LKAMING97
Copy link

1 + 1 等于 2。这是基础的加法运算结果。

DeepSeek-R1-IQ1_M_671b:latest
Use Ctrl + d or /bye to exit.
How many ‘r’s are in the word ‘strawberry’?
%>-+><|begin▁of▁sentence|>+5@A

How many ‘r’s are in the word ‘strawberry’?
2<|begin▁of▁sentence|>-!/=7?<|▁pad▁|>C#(2<|begin▁of▁sentence|><|▁pad▁|>98)8(#!B=%!48-AE&:?0@AD

输出一堆乱码,有大佬知道咋回事吗?

你好我也是部署r1在服务端回答出现乱码,请问如何解决呢

Copy link

What is the ollama version?0.5.7?

Copy link
Owner

What is the ollama version?0.5.7?

@zengqingfu1442 yes, it is 0.5.7

@lcgogo
Copy link

lcgogo commented Feb 13, 2025

ollama create DeepSeek-R1-UD-IQ1_M -f DeepSeekQ1_Modelfile 创建模型的时候,运行这个你们等了多久啊,我一直卡在gathering model components中,想问下

看机器,这个我感觉比较耗cpu和diskio,没仔细看,大概10+分钟

@zengqingfu1442
Copy link

zengqingfu1442 commented Feb 13, 2025

I use ollama to run DeepSeek-R1-Q4_K_M.Why does the ollama say the model architecture is deepseekv2? The deepseek-v2 has 61 layers but deepseek-r1 should have 32 layers, right?

>>> /show info
  Model
    architecture        deepseek2
    parameters          671.0B
    context length      163840
    embedding length    7168
    quantization        Q4_K_M

  Parameters
    num_ctx        2048
    num_gpu        12
    num_predict    8192
    temperature    0.6

Copy link

CHN-STUDENT commented Feb 14, 2025

目前用 llama.cpp server 做后端,测试 5 token/s 生成,8 * L20,没有测并发。
感觉 llama.cpp 比 ollama 快多了,占用感觉也少一点,
context shift is disabled 倒是没出现,可能我也测试不充分。

# 环境变量设置
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1

llama-server \
    --model /data/models/deepseek/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ2_XXS/DeepSeek-R1-UD-IQ2_XXS.gguf \
    --cache-type-k q4_0 \
    --threads 16 \
    --n-gpu-layers 62 \
    --temp 0.6 \
    --ctx-size 8192 \
    --prio 2 \
    --seed 3407 \
    --host 0.0.0.0 \
    --port 8088 \
    --tensor-split 0.125,0.125,0.125,0.125,0.125,0.125,0.125,0.125 \
    --mlock \
    --flash-attn \
    --np 4

Copy link

@lcgogo 有一张卡闲置可能是你测试的时候只有一个上下文,我这边测试也是两个A100,多个终端访问它会把两张卡都吃起来。另外,num_gpu 配置了61,大概14token/s

@zengqingfu1442
Copy link

目前用 llama.cpp server 做后端,测试 5 token/s 生成,8 * L20,没有测并发。 感觉 llama.cpp 比 ollama 快多了,占用感觉也少一点, context shift is disabled 倒是没出现,可能我也测试不充分。

# 环境变量设置
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1

llama-server \
    --model /data/models/deepseek/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ2_XXS/DeepSeek-R1-UD-IQ2_XXS.gguf \
    --cache-type-k q4_0 \
    --threads 16 \
    --n-gpu-layers 62 \
    --temp 0.6 \
    --ctx-size 8192 \
    --prio 2 \
    --seed 3407 \
    --host 0.0.0.0 \
    --port 8088 \
    --tensor-split 0.125,0.125,0.125,0.125,0.125,0.125,0.125,0.125 \
    --mlock \
    --flash-attn \
    --np 4

ollama底层的推理引擎用的也是llama.cpp吧?应该差不多才对吧

@CHN-STUDENT
Copy link

目前用 llama.cpp server 做后端,测试 5 token/s 生成,8 * L20,没有测并发。 感觉 llama.cpp 比 ollama 快多了,占用感觉也少一点, context shift is disabled 倒是没出现,可能我也测试不充分。

# 环境变量设置
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1

llama-server \
    --model /data/models/deepseek/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ2_XXS/DeepSeek-R1-UD-IQ2_XXS.gguf \
    --cache-type-k q4_0 \
    --threads 16 \
    --n-gpu-layers 62 \
    --temp 0.6 \
    --ctx-size 8192 \
    --prio 2 \
    --seed 3407 \
    --host 0.0.0.0 \
    --port 8088 \
    --tensor-split 0.125,0.125,0.125,0.125,0.125,0.125,0.125,0.125 \
    --mlock \
    --flash-attn \
    --np 4

ollama底层的推理引擎用的也是llama.cpp吧?应该差不多才对吧

没错,但是我感觉 openwebui 调用 ollama 挺慢的,反应没有 llama.cpp 快,不打算评测了,还是学习做企业RAG应用吧。

Copy link

小白提问一下,unsloth里的模型,是否带-UD-的就是dynamically quantized。另外512GB内存的服务器建议选用哪个

Copy link

wwl5600 commented Feb 21, 2025

在两张4090 24G机器上运行DeepSeek-R1-UD-IQ1_M(671B,1.73-bit 动态量化),已经设置了ollama参数,但是运行时候不能调用GPU,每秒低于1 token,有什么解决办法吗?

total duration:       10m8.084566633s
load duration:        53.840382ms
prompt eval count:    27 token(s)
prompt eval duration: 1m32.39s
prompt eval rate:     0.29 tokens/s
eval count:           409 token(s)
eval duration:        8m35.62s
eval rate:            0.79 tokens/s
(base) [root@gromacs ~]# sudo systemctl status ollama
● ollama.service - Ollama AI Server
   Loaded: loaded (/etc/systemd/system/ollama.service; disabled; vendor preset: disabled)
   Active: active (running) since Thu 2025-02-20 21:33:50 EST; 945ms ago
 Main PID: 113334 (ollama)
    Tasks: 13 (limit: 1226801)
   Memory: 29.5M
   CGroup: /system.slice/ollama.service
           └─113334 /usr/bin/ollama serve

2月 20 21:33:50 gromacs systemd[1]: Started Ollama AI Server.
2月 20 21:33:50 gromacs ollama[113334]: 2025/02/20 21:33:50 routes.go:1186: INFO server config env="map[CUDA_VISIBLE_DEVICES:GPU-1ae7740e-ee9e-2395-8f99-0614015a6eea,GPU-b1e6bf42-86e2-a44c-8604-0504732a4a59 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:10m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:1h0m0s OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/DeepSeek/Models/ OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
(gmxMMPBSA) [root@gromacs ~]# ollama ps
NAME                           ID              SIZE      PROCESSOR          UNTIL
DeepSeek-R1-UD-IQ1_S:latest    559ae1a9bd0c    182 GB    78%/22% CPU/GPU    6 minutes from now
(gmxMMPBSA) [root@gromacs ~]# nvidia-smi
Thu Feb 20 22:35:55 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:3B:00.0 Off |                  Off |
|  0%   49C    P8             28W /  450W |     165MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off |   00000000:AF:00.0 Off |                  Off |
| 30%   31C    P8             29W /  450W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      3924      G   /usr/libexec/Xorg                              57MiB |
|    0   N/A  N/A      4361      G   /usr/bin/gnome-shell                           87MiB |
|    1   N/A  N/A      3924      G   /usr/libexec/Xorg                               4MiB |
+-----------------------------------------------------------------------------------------+

@rodickmini
Copy link

@wwl5600 试试在环境变量中添加:

export OLLAMA_KEEP_ALIVE=-1 export OLLAMA_SCHED_SPREAD=1

让Ollama运行时尽量占满GPU,然后重新Ollama run 模型

@wwl5600
Copy link

wwl5600 commented Feb 21, 2025

@rodickmini 试了修改这两个环境变量,但还是不行,还是不能调用GPU,感谢~

@CHN-STUDENT
Copy link

CHN-STUDENT commented Feb 22, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests