Difference in different quantization methods #2094

sussyboiiii · 2023-07-04T06:31:36Z

sussyboiiii
Jul 4, 2023

Hello,
I'm wondering what quantization method or what you want to call it has the best output quality. Should you use q8_0, q4_0 or anything in between? I'm asking this question because the q8_0 version almost takes up as much space as the f16 version (13.5GB) but q4_0 only takes about 8GB, I'm talking about Vicuna 13b.
Thanks!

Answered by Green-Sky

Jul 4, 2023

K-quantizations should be better, at the same file size, then the other ones. S M L means small medium large :)
more details can be found here: #1684

View full answer

ianscrivener · 2023-07-04T09:32:39Z

ianscrivener
Jul 4, 2023

The llama.cpp team does do some "perplexity" testing - which approximately determines "best output quality"

lowers score are better...

1 reply

ianscrivener Jul 4, 2023

More here: #406

actually-a-cat · 2023-07-04T09:54:45Z

actually-a-cat
Jul 4, 2023

quantize --help outputs a helpful table:

Allowed quantization types:
   2  or  Q4_0   :  3.50G, +0.2499 ppl @ 7B - small, very high quality loss - legacy, prefer using Q3_K_M
   3  or  Q4_1   :  3.90G, +0.1846 ppl @ 7B - small, substantial quality loss - legacy, prefer using Q3_K_L
   8  or  Q5_0   :  4.30G, +0.0796 ppl @ 7B - medium, balanced quality - legacy, prefer using Q4_K_M
   9  or  Q5_1   :  4.70G, +0.0415 ppl @ 7B - medium, low quality loss - legacy, prefer using Q5_K_M
  10  or  Q2_K   :  2.67G, +0.8698 ppl @ 7B - smallest, extreme quality loss - not recommended
  12  or  Q3_K   : alias for Q3_K_M
  11  or  Q3_K_S :  2.75G, +0.5505 ppl @ 7B - very small, very high quality loss
  12  or  Q3_K_M :  3.06G, +0.2437 ppl @ 7B - very small, very high quality loss
  13  or  Q3_K_L :  3.35G, +0.1803 ppl @ 7B - small, substantial quality loss
  15  or  Q4_K   : alias for Q4_K_M
  14  or  Q4_K_S :  3.56G, +0.1149 ppl @ 7B - small, significant quality loss
  15  or  Q4_K_M :  3.80G, +0.0535 ppl @ 7B - medium, balanced quality - *recommended*
  17  or  Q5_K   : alias for Q5_K_M
  16  or  Q5_K_S :  4.33G, +0.0353 ppl @ 7B - large, low quality loss - *recommended*
  17  or  Q5_K_M :  4.45G, +0.0142 ppl @ 7B - large, very low quality loss - *recommended*
  18  or  Q6_K   :  5.15G, +0.0044 ppl @ 7B - very large, extremely low quality loss
   7  or  Q8_0   :  6.70G, +0.0004 ppl @ 7B - very large, extremely low quality loss - not recommended
   1  or  F16    : 13.00G              @ 7B - extremely large, virtually no quality loss - not recommended
   0  or  F32    : 26.00G              @ 7B - absolutely huge, lossless - not recommended

The ppl column is perplexity increase relative to unquantized.
Q4_K_M, Q5_K_S and Q5_K_M are considered "recommended".

7 replies

sussyboiiii Jul 4, 2023
Author

Thank you,
so I would probably want Q5_K_M, but I haven't heard of K_M so what would it be, would it be q5_0 or what would be the type I'd prefer?

Green-Sky Jul 4, 2023
Collaborator

K-quantizations should be better, at the same file size, then the other ones. S M L means small medium large :)
more details can be found here: #1684

Answer selected by sussyboiiii

sussyboiiii Jul 4, 2023
Author

So this is a completely new quantization variant? Can this be done with the quantize executable from f16 to 5_K_M as well?

Green-Sky Jul 4, 2023
Collaborator

yes, that is the intended procedure.

sussyboiiii Jul 4, 2023
Author

Thanks!

Green-Sky Jul 4, 2023
Collaborator

i'm pretty sure the vicuna model is supplied as f16, so upconverting to f32 wont get you anything. besides doubling the intermediary file size :)
i think you need to specify the output model path, even if it is marked as optional in the help. might need to fix this.
usage: bin/quantize [--help] [--allow-requantize] [--leave-output-tensor] model-f32.bin [model-quant.bin] type [nthreads]
-->
bin/quantize ~/desktop/vicuna/model-f32.bin ~/desktop/vicuna/model-q5_k_m.bin Q5_K_M

Timophey999 · 2024-09-12T11:09:17Z

Timophey999
Sep 12, 2024

Брат, как сделать так, чтобы она информацию из реального времени брала?

1 reply

sussyboiiii Sep 13, 2024
Author

I don't think it's possible with llama.cpp at the moment.

ytm369 · 2024-10-27T05:41:57Z

ytm369
Oct 27, 2024

Why q8, f16 and f32 are not recommended even if there is low quality loss.

1 reply

sussyboiiii Oct 27, 2024
Author

Because they will use much more ram and run way slower.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference in different quantization methods #2094

{{title}}

Replies: 4 comments 10 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Difference in different quantization methods #2094

sussyboiiii Jul 4, 2023

Replies: 4 comments · 10 replies

ianscrivener Jul 4, 2023

ianscrivener Jul 4, 2023

actually-a-cat Jul 4, 2023

sussyboiiii Jul 4, 2023 Author

Green-Sky Jul 4, 2023 Collaborator

sussyboiiii Jul 4, 2023 Author

Green-Sky Jul 4, 2023 Collaborator

sussyboiiii Jul 4, 2023 Author

Green-Sky Jul 4, 2023 Collaborator

Timophey999 Sep 12, 2024

sussyboiiii Sep 13, 2024 Author

ytm369 Oct 27, 2024

sussyboiiii Oct 27, 2024 Author

sussyboiiii
Jul 4, 2023

Replies: 4 comments 10 replies

ianscrivener
Jul 4, 2023

actually-a-cat
Jul 4, 2023

sussyboiiii Jul 4, 2023
Author

Green-Sky Jul 4, 2023
Collaborator

sussyboiiii Jul 4, 2023
Author

Green-Sky Jul 4, 2023
Collaborator

sussyboiiii Jul 4, 2023
Author

Green-Sky Jul 4, 2023
Collaborator

Timophey999
Sep 12, 2024

sussyboiiii Sep 13, 2024
Author

ytm369
Oct 27, 2024

sussyboiiii Oct 27, 2024
Author