Replies: 2 comments 1 reply
-
I do see efforts to add this in the code base. But it doesn't seem to be ready yet. Any maintainer can comment on when binary and scalar quantisation will be made available? |
Beta Was this translation helpful? Give feedback.
1 reply
-
Hello. I would like to know more about this. Currently I'm using infinity to locally host a JinaAI embedding (using a single Radeon 7900 XTX 24 GB). It would be great to be able to decrease the VRAM usage per batch by converting the model to INT8 or INT4 if possible. Does infinity currently support this ? I would gladly test any suggestion on my local setup. Thanks! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I’m exploring embeddings quantization strategies to optimize storage and computation efficiency while keeping accuracy high. Does anyone have experience applying embedding quantization with infinity? Based on: https://sbert.net/examples/applications/embedding-quantization/README.html
I see two options:
Simplifies embeddings by converting them to binary format, significantly reducing the data size but also altering the dimensionality.
Converts the data type from float32 to int8, maintaining the original dimensionality (1024 in this case) which offers a balance between size reduction and precision retention.
Here is how to do binary quant. Using sentence_transformers. It works quite straightforward:
And here is the implementation for Scalar (int8) Quantization:
For scalar quant to produce good results it is recommended to provide it with either:
a) a large set of embeddings to quantize all at once, or
b) min and max ranges for each of the embedding dimensions, or
c) a large calibration dataset of embeddings from which the min and max ranges can be computed.
As our embedding API here does embeddings “on the fly” usually with just one or a small list of embeddings, option a) seems out. Option c) is also out as we want our embedding api to be flexible (handle various datasets). Which leaves us with option b).
Anyone having an idea how to implement this? Anyone having a better idea for quantization?
Beta Was this translation helpful? Give feedback.
All reactions