-
Hello, I did not find anything related to this question online, and I am surprised about it, so maybe this question is a bit dumb without me realizing it. The issue is related to how the server example streams a response, more specifically the number of updates it provides about new tokens. Currently, on any UI I use with a llama.cpp server backend, the streamed response is sent as chunks of 50 or so tokens. Why not send an update on every token like any other API? If I want to chat with a big model with slow inference speed, I wait a few minutes before receiving anything, which is surely not intentional. And when I receive something, it's a group of around 50 tokens. Not practical :/ The llama-cli example seems to write token by token well on a terminal, and since the server example is based on this inference code, I wonder what I am doing wrong or not doing in order to end up with a half-streamed response. Any help is appreciated. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
On my end, manually starting llama.cpp server with command line, it's updated on each tokens, stream being sent in body of the POST request. |
Beta Was this translation helpful? Give feedback.
-
I did a bit more research, and was able to locate the issue. It was nginx's fault. Adding the following to my configuration solved my issue:
|
Beta Was this translation helpful? Give feedback.
I did a bit more research, and was able to locate the issue. It was nginx's fault. Adding the following to my configuration solved my issue: