You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OpenAI's GPT-4o is not open-source. I was reading a reddit thread recently speculating on it's architecture which enables text, vision, and voice modalities into one model.
One user speculates:
I wonder if it's something closer to the original DALL-E where the image was decomposed into image tokens ... The embeddings of the image tokens and text tokens could share the same latent space, so that model was "natively" multimodal.
Another replies:
Yes, I think that's exactly it ... 'Just' train a encoder tokenizer for each modality, maybe define some of the extra 100k BPEs as modality-specific delimiters similar to delimiting prompts/end-of-text tokens - and then it's just 'tokenize all the things'
Which all got me thinking... what would it look like to "tokenize all the things" in a genomic context? We have modalities like, scATAC-seq, scRNA-seq, methylation, and then even textual metadata associated with these datasets. I've proposed two multi-modal architectures in the past: one being the scRNA-seq tokenizer (https://github.com/databio/geniml_dev/issues/123), and then a CLIP-like model. But could we come up with ideas to "tokenize all the things" such that a model could take anything as input (scRNA-seq, scATAC-seq, methylation, or text) and then either 1) output an embedding, or 2) generate another modality.
Of course at the end of the day... we need the datasets :/
The text was updated successfully, but these errors were encountered:
OpenAI's
GPT-4o
is not open-source. I was reading a reddit thread recently speculating on it's architecture which enables text, vision, and voice modalities into one model.One user speculates:
Another replies:
Which all got me thinking... what would it look like to "tokenize all the things" in a genomic context? We have modalities like, scATAC-seq, scRNA-seq, methylation, and then even textual metadata associated with these datasets. I've proposed two multi-modal architectures in the past: one being the scRNA-seq tokenizer (https://github.com/databio/geniml_dev/issues/123), and then a CLIP-like model. But could we come up with ideas to "tokenize all the things" such that a model could take anything as input (scRNA-seq, scATAC-seq, methylation, or text) and then either 1) output an embedding, or 2) generate another modality.
Of course at the end of the day... we need the datasets :/
The text was updated successfully, but these errors were encountered: