Replies: 4 comments
-
Sounds more like a binding problem. The cpp API (in llama.h) can already only predict a single word at a time. Multiple word prediction is a loop that is handled by application code. |
Beta Was this translation helpful? Give feedback.
-
AFAIK I think if you are willing to "stream" then if you stop requesting next-token, it will indeed stop. ie: if you simply don't request the next-token it won't do additional processing. |
Beta Was this translation helpful? Give feedback.
-
Thanks; I'll look into llama-cpp-python's bindings with the API then, and see what I can do from there. |
Beta Was this translation helpful? Give feedback.
-
@spirilis I think that the llama-cpp-python binding works differently in that it doesn't call the server example which has "next-token". I looked at https://github.com/keldenl/gpt-llama.cpp/blob/master/routes/chatRoutes.js and for example they kill the "main" object class file when there is a closure. So each project might handle this differently. |
Beta Was this translation helpful? Give feedback.
-
I asked this as an Issue over in the llama-cpp-python project - abetlen/llama-cpp-python#313 - one minor issue I see is that when running llama.cpp as a library behind an API server, if a client decides to terminate connection the model seems to keep running.
This could be handled better by having the llama.cpp backend code support something like SIGHUP (Hangup - perfect analogy for this as the client "hung up") and returning a null result, thereby allowing the API server the ability to serve another query immediately after.
Any thoughts? Does llama.cpp already support this in another manner & we just need to find/implement it in the python?
Beta Was this translation helpful? Give feedback.
All reactions