Hierarchical universes and a tokenizer config #25

nleroy917 · 2024-06-07T17:26:56Z

NLP/huggingface tokenizer vocabularies are often distributed as .json configuration files. The reason for this is that modern, language tokenizers are configurable beyond their respective vocabularies (e.g. pre-processors, special tokens, post-processors, etc).

Should we distributegtokenizers the same way? Instead of a single BED-file, its a .yaml file that points to a BED-file, in addition to other things like maybe a list of exclude_ranges, secondary universes (hierarchical tokenization), etc.

Could be a way to implement hierarchical universes in addition to enhancing the fragment tokenizers

The text was updated successfully, but these errors were encountered:

nleroy917 added enhancement New feature or request brainstorming labels Jun 7, 2024

nleroy917 mentioned this issue Jun 25, 2024

Hierarchical tokenizers #27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hierarchical universes and a tokenizer config #25

Hierarchical universes and a tokenizer config #25

nleroy917 commented Jun 7, 2024

Hierarchical universes and a tokenizer config #25

Hierarchical universes and a tokenizer config #25

Comments

nleroy917 commented Jun 7, 2024