how can I get the best inference speed in my situation #9503
Unanswered
FranzKafkaYu
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello guys,I am working on llama.cpp in my Android device,and each time the inference will begin with the same pattern:
the
prompt.prefix
andprompt.suffix
are both constant and won't change,the only changed is User Input.Currently I am using the code below which is from simple.cpp in examples:two questions here:
llama_decode
will cost 1000ms+ and each time theinput_prefix
andinput_suffix
will be tokenized/decoded repeatedly,is ther any way to reuse the output after tokenize/decode theinput_prefix
andinput_suffix
?Hoping you guys can give me some advice,thanks!
Beta Was this translation helpful? Give feedback.
All reactions