You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NLP/huggingface tokenizer vocabularies are often distributed as .json configuration files. The reason for this is that modern, language tokenizers are configurable beyond their respective vocabularies (e.g. pre-processors, special tokens, post-processors, etc).
Should we distributegtokenizers the same way? Instead of a single BED-file, its a .yaml file that points to a BED-file, in addition to other things like maybe a list of exclude_ranges, secondary universes (hierarchical tokenization), etc.
Could be a way to implement hierarchical universes in addition to enhancing the fragment tokenizers
The text was updated successfully, but these errors were encountered:
NLP/huggingface tokenizer vocabularies are often distributed as
.json
configuration files. The reason for this is that modern, language tokenizers are configurable beyond their respective vocabularies (e.g. pre-processors, special tokens, post-processors, etc).Should we distribute
gtokenizers
the same way? Instead of a single BED-file, its a.yaml
file that points to a BED-file, in addition to other things like maybe a list ofexclude_ranges
, secondary universes (hierarchical tokenization), etc.Could be a way to implement hierarchical universes in addition to enhancing the fragment tokenizers
The text was updated successfully, but these errors were encountered: