Embedding Quantization #198

Nookbe · 2024-04-10T08:34:32Z

Nookbe
Apr 10, 2024

I’m exploring embeddings quantization strategies to optimize storage and computation efficiency while keeping accuracy high. Does anyone have experience applying embedding quantization with infinity? Based on: https://sbert.net/examples/applications/embedding-quantization/README.html

I see two options:

Binary quantization:
Simplifies embeddings by converting them to binary format, significantly reducing the data size but also altering the dimensionality.
Scalar (int8) Quantization:
Converts the data type from float32 to int8, maintaining the original dimensionality (1024 in this case) which offers a balance between size reduction and precision retention.

Here is how to do binary quant. Using sentence_transformers. It works quite straightforward:

from sentence_transformers import SentenceTransformer
from sentence_transformers.quantization import quantize_embeddings

# 1. Load an embedding model
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

# 2a. Encode some text using "binary" quantization
binary_embeddings = model.encode(
    ["I am driving to the lake.", "It is a beautiful day."],
    precision="binary",
)

# 2b. or, encode some text without quantization & apply quantization afterwards
embeddings = model.encode(["I am driving to the lake.", "It is a beautiful day."])
binary_embeddings = quantize_embeddings(embeddings, precision="binary")

And here is the implementation for Scalar (int8) Quantization:

from sentence_transformers import SentenceTransformer
from sentence_transformers.quantization import quantize_embeddings
from datasets import load_dataset

# 1. Load an embedding model
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

# 2. Prepare an example calibration dataset
corpus = load_dataset("nq_open", split="train[:1000]")["question"]
calibration_embeddings = model.encode(corpus)

# 3. Encode some text without quantization & apply quantization afterwards
embeddings = model.encode(["I am driving to the lake.", "It is a beautiful day."])
int8_embeddings = quantize_embeddings(
    embeddings,
    precision="int8",
    calibration_embeddings=calibration_embeddings,
)

For scalar quant to produce good results it is recommended to provide it with either:

a) a large set of embeddings to quantize all at once, or
b) min and max ranges for each of the embedding dimensions, or
c) a large calibration dataset of embeddings from which the min and max ranges can be computed.

As our embedding API here does embeddings “on the fly” usually with just one or a small list of embeddings, option a) seems out. Option c) is also out as we want our embedding api to be flexible (handle various datasets). Which leaves us with option b).

Anyone having an idea how to implement this? Anyone having a better idea for quantization?

stephenleo · 2024-04-27T14:43:24Z

stephenleo
Apr 27, 2024

I do see efforts to add this in the code base. But it doesn't seem to be ready yet. Any maintainer can comment on when binary and scalar quantisation will be made available?

infinity/libs/infinity_emb/infinity_emb/transformer/embedder/sentence_transformer.py

Line 111 in ae95434

if self.embedding_dtype.value != "float32":

1 reply

michaelfeil Apr 29, 2024
Maintainer

@stephenleo Working on it! Currently the min/max values for each dynamic quantization of the embeddings is calculated in a stateful way.
This is not good a stateless server.

Might need to add a calibration dataset to see in which ranges the embeddings are to quantize them to int8 / int1.

Michael

hvico · 2024-08-06T22:11:35Z

hvico
Aug 6, 2024

Hello. I would like to know more about this. Currently I'm using infinity to locally host a JinaAI embedding (using a single Radeon 7900 XTX 24 GB). It would be great to be able to decrease the VRAM usage per batch by converting the model to INT8 or INT4 if possible. Does infinity currently support this ? I would gladly test any suggestion on my local setup. Thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedding Quantization #198

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Embedding Quantization #198

Nookbe Apr 10, 2024

Replies: 2 comments · 1 reply

stephenleo Apr 27, 2024

michaelfeil Apr 29, 2024 Maintainer

hvico Aug 6, 2024

Nookbe
Apr 10, 2024

Replies: 2 comments 1 reply

stephenleo
Apr 27, 2024

michaelfeil Apr 29, 2024
Maintainer

hvico
Aug 6, 2024