You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I have a question about the image reconsturction via VAR.
I want the transformer model to predict the ground truth tokens, just like in the training situation, by obtaining image tokens through an vq-encoder, and then interpolating the tokens, finally inputting them into the transformer. (like inversion in diffusion models)
However, when I configured the code, there was a difference from the original image.
Could I have missed something, or is this approach not feasible?
Here's original image and recon image.
original image
recon image
Hello, I have a question about the image reconsturction via VAR.
I want the transformer model to predict the ground truth tokens, just like in the training situation, by obtaining image tokens through an vq-encoder, and then interpolating the tokens, finally inputting them into the transformer. (like inversion in diffusion models)
However, when I configured the code, there was a difference from the original image.
Could I have missed something, or is this approach not feasible?
Here's original image and recon image.
original image
recon image
And Here's my code.
tr_input_embed's shape is [B, 679, 32]
And I implement this code in the VAR class.
Again, thanks for your great work!
The text was updated successfully, but these errors were encountered: