Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to accelerate bge m3 sparse embeding module when inference? #294

Open
seetimee opened this issue Jul 2, 2024 · 2 comments
Open

how to accelerate bge m3 sparse embeding module when inference? #294

seetimee opened this issue Jul 2, 2024 · 2 comments

Comments

@seetimee
Copy link

seetimee commented Jul 2, 2024

Feature request

how to accelerate bge m3 sparse embeding module when inference?

Motivation

the sparse embeding process is too slow during infer bge-m3 after accelerate the dense emb inference

Your contribution

you can give a idea,I will learn how to make it work

Copy link
Contributor

greptile-apps bot commented Jul 2, 2024

To accelerate the bge-m3 sparse embedding module during inference, consider the following steps:

  1. Optimize Model Loading:

    • Ensure the model is loaded efficiently in select_model.py.
    • Use optimized libraries like ONNX or TensorRT if supported.
  2. Batch Processing:

    • Increase the batch size in BatchHandler initialization in engine.py.
    • Adjust batch_delay to balance between latency and throughput.
  3. Caching:

    • Implement or enhance caching mechanisms in batch_handler.py to reuse embeddings.
  4. Hardware Acceleration:

    • Utilize GPU or specialized hardware (e.g., AWS Neuron) for inference.
    • Check transformer/embedder/neuron.py for Neuron-specific optimizations.
  5. Parallel Processing:

    • Use multi-threading or asynchronous processing in BatchHandler and ModelWorker classes.

Example Code Adjustments:

# engine.py
self._batch_handler = BatchHandler(
    max_batch_size=self._engine_args.batch_size * 2,  # Increase batch size
    model=self._model,
    batch_delay=self._min_inference_t / 4,  # Adjust batch delay
    vector_disk_cache_path=self._engine_args.vector_disk_cache_path,
    verbose=logger.level <= 10,
    lengths_via_tokenize=self._engine_args.lengths_via_tokenize,
)
# batch_handler.py
class BatchHandler:
    def __init__(self, ...):
        ...
        self._threadpool = ThreadPoolExecutor(max_workers=8)  # Increase thread pool size
        ...

These changes should help in accelerating the sparse embedding process for the bge-m3 model.

References

/libs/infinity_emb/infinity_emb/engine.py
/libs/infinity_emb/infinity_emb/inference/select_model.py
/libs/infinity_emb/infinity_emb/inference/batch_handler.py
/libs/infinity_emb/infinity_emb/inference
/libs/infinity_emb/infinity_emb/transformer/embedder

About Greptile

This response provides a starting point for your research, not a precise solution.

Help us improve! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

Ask Greptile · Edit Issue Bot Settings

@seetimee
Copy link
Author

seetimee commented Jul 2, 2024

These methods are difficult to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant