Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing in image encoding for Florence 2 #1170

Closed
ir2718 opened this issue Jan 27, 2025 · 7 comments
Closed

Processing in image encoding for Florence 2 #1170

ir2718 opened this issue Jan 27, 2025 · 7 comments
Labels
question Further information is requested

Comments

@ir2718
Copy link

ir2718 commented Jan 27, 2025

Question

Hi,

while having a look at the code for generation with the Florence 2 model, I've noticed something weird. The original code for inference uses the _encode_image method for creating image features. However, looking at the encode_image used in transformers.js, I've noticed the postprocessing after the model forward pass is missing. Here's a minimal reproducible example:

import onnxruntime as ort

from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image

# The vision encoder was downloaded from:
# https://huggingface.co/onnx-community/Florence-2-base-ft/resolve/main/onnx/vision_encoder.onnx
ONNX_MODEL_PATH = "models/onnx/original/vision_encoder.onnx"
MODEL_NAME = "microsoft/Florence-2-base-ft"
# Image download link:
# https://upload.wikimedia.org/wikipedia/en/7/7d/Lenna_%28test_image%29.png
IMG_PATH = "lena.png"
PROMPT = "<MORE_DETAILED_CAPTION>"

processor = AutoProcessor.from_pretrained(
    MODEL_NAME, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, trust_remote_code=True)

image = Image.open(IMG_PATH)
inputs = processor(text=PROMPT, images=image, return_tensors="pt")

hf_out = model._encode_image(inputs["pixel_values"])

ort_vision_tower = ort.InferenceSession(ONNX_MODEL_PATH)
ort_out = ort_vision_tower.run(
    None, {"pixel_values": inputs["pixel_values"].numpy()})[0]

print(hf_out.cpu().detach().numpy())
print()
print(ort_out)

The feature differences are pretty big:

[[[-0.4047455   0.51958734 -0.23121671 ...  1.0019573  -0.46846968
    0.5289913 ]
  [-0.08135182 -2.0622678  -0.50597775 ...  0.38061845 -0.7858853
   -1.247189  ]
  [ 0.69417834 -1.926735   -0.691345   ... -0.17574754 -0.98472327
   -1.2420652 ]
  ...
  [ 0.018062    1.2185848  -0.04483193 ...  0.61767036 -0.1832848
    0.9324351 ]
  [-0.13765828  0.7120823   0.12478658 ... -0.44853052 -0.6390534
    0.37095645]
  [ 0.58084226  1.6617624  -0.43527135 ... -0.92560166 -0.47037867
   -0.81996024]]]

[[[-0.52661824  0.508744   -0.24130312 ...  0.91191643 -0.39472336
    1.1632534 ]
  [-0.18091503 -2.2187433  -0.7923498  ...  0.6103708  -0.49637306
   -0.9830185 ]
  [ 0.3002218  -1.9726763  -1.1151179  ... -0.11572987 -0.6870862
   -0.96058726]
  ...
  [-0.08202907  0.8105656  -0.1748765  ...  1.0833437  -0.41167092
    1.2495995 ]
  [-0.01531404  0.6044417  -0.06392197 ... -0.30775025 -0.5735508
    0.6775356 ]
  [ 0.74322057  1.4011574  -0.5277405  ... -0.61488384 -0.40253094
   -0.8440974 ]]]

Am I missing something here or is this a potential bug?

@ir2718 ir2718 added the question Further information is requested label Jan 27, 2025
@xenova
Copy link
Collaborator

xenova commented Feb 8, 2025

This might be due to image reading differences in JavaScript vs. Python. Could you try passing the exact same data (e.g., all-zero tensor) to see if the difference is there too? Also, remember to load the full-precision model in Transformers.js, as this could be another source for differences.

@ir2718
Copy link
Author

ir2718 commented Feb 8, 2025

I've modified the minimal example by creating a blank image as follows:

image = Image.new("RGB", (512, 512))

However, the results are still different:

[[[ 0.7425719   0.01071071 -0.06678165 ...  1.2542837  -0.2884356
   -0.69630283]
  [ 0.9451557  -0.30704248  0.69962746 ... -0.72856545  0.15360388
   -1.0232862 ]
  [ 1.2876275   0.5174419  -0.2222641  ... -0.32981807  0.44000283
   -1.1317426 ]
  ...
  [ 0.20901655 -0.39984626  0.1699695  ...  1.923425   -0.6329966
   -0.91588783]
  [ 0.20724754 -0.40770236  0.42595854 ...  1.7196184  -0.38901007
   -1.0207707 ]
  [-0.08099215 -0.3391677  -0.17075935 ...  1.9568288  -0.02066579
   -1.1172475 ]]]

[[[ 0.0817447   0.41585156 -0.03429735 ...  1.6622943  -0.43160683
    0.08325118]
  [ 0.1999208   0.37867606  0.47249985 ... -0.29732558 -0.00243429
   -0.6437535 ]
  [ 0.27014455  1.0596321   0.04975559 ...  0.2688354   0.25734758
   -0.3757942 ]
  ...
  [-0.09077676  0.39021495  0.19065166 ...  1.6975157  -0.41929
   -0.6461764 ]
  [-0.18240517  0.71244407  0.34832954 ...  1.4980354  -0.24869794
   -0.6761538 ]
  [-0.51288855  0.36046848 -0.42776367 ...  0.80509955 -0.21319357
   -0.94580245]]]

To clear any misunderstanding, the model I used is converted in full precision. Unfortunately, using the model in transformers.js is not an option for me as my use case requires python.

@Md-Sayeed-Khan
Copy link

Hey , is the issue resolved , can you show how to get inference using onnx model ?

@Md-Sayeed-Khan
Copy link

how to decode the embeddings ?

@ir2718
Copy link
Author

ir2718 commented Feb 28, 2025

@Md-Sayeed-Khan

The issue has been resolved, and the conversion script has been updated. However, I'm not sure if the models on the hub are updated, as I've used the script directly.

Unfortunately, I cannot share the inference code as I worked on this in my current company. You will have to do some trial and error, but I can confirm that you can get it working in python.

For decoding, you will have to replicate some kind of decoding strategy like greedy decoding or beam search.

@Md-Sayeed-Khan
Copy link

@ir2718 Thankyou for the update, can these converted models be used for object detection task ?

@xenova
Copy link
Collaborator

xenova commented Mar 2, 2025

The issue has been resolved, and the conversion script has been updated.

Since this issue has been resolved, I'll close the issue, but feel free to continue discussion here.

Thankyou for the update, can these converted models be used for object detection task ?

Yes, the models are capable of this - you just need to specify the correct prompts. See the original model card for more details.

@xenova xenova closed this as completed Mar 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants