Tokenize all the things - genomic omni-model #19

nleroy917 · 2024-05-18T16:21:09Z

OpenAI's GPT-4o is not open-source. I was reading a reddit thread recently speculating on it's architecture which enables text, vision, and voice modalities into one model.

One user speculates:

I wonder if it's something closer to the original DALL-E where the image was decomposed into image tokens ... The embeddings of the image tokens and text tokens could share the same latent space, so that model was "natively" multimodal.

Another replies:

Yes, I think that's exactly it ... 'Just' train a encoder tokenizer for each modality, maybe define some of the extra 100k BPEs as modality-specific delimiters similar to delimiting prompts/end-of-text tokens - and then it's just 'tokenize all the things'

Which all got me thinking... what would it look like to "tokenize all the things" in a genomic context? We have modalities like, scATAC-seq, scRNA-seq, methylation, and then even textual metadata associated with these datasets. I've proposed two multi-modal architectures in the past: one being the scRNA-seq tokenizer (https://github.com/databio/geniml_dev/issues/123), and then a CLIP-like model. But could we come up with ideas to "tokenize all the things" such that a model could take anything as input (scRNA-seq, scATAC-seq, methylation, or text) and then either 1) output an embedding, or 2) generate another modality.

Of course at the end of the day... we need the datasets :/

The text was updated successfully, but these errors were encountered:

nleroy917 added the brainstorming label May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenize all the things - genomic omni-model #19

Tokenize all the things - genomic omni-model #19

nleroy917 commented May 18, 2024

Tokenize all the things - genomic omni-model #19

Tokenize all the things - genomic omni-model #19

Comments

nleroy917 commented May 18, 2024