GCGC

GCGC is a tool for feature processing on Biological Sequences.

Installation

GCGC is primarily intended to be used as part of a larger workflow inside Python.

To install via pip:

$ pip install gcgc

If you'd like to use code that helps gcgc's tokenizers integrate with common third party libraries, either install those packages separately, or use gcgc's extras.

$ pip install 'gcgc[hf]'

Documentation

The GCGC documentation is at gcgc.trenthauck.com, please see it for examples.

Quick Start

The easiest way to get started is to import the kmer tokenizer, configure it, then start tokenizing.

from gcgc import KmerTokenizer

kmer_tokenizer = KmerTokenizer(alphabet="unambiguous_dna")
encoded = kmer_tokenizer.encode("ATCG")
print(encoded)

sample output:

[1, 6, 7, 8, 5, 2]

This output includes the "bos" token, the "eos" token, and the four nucleotide tokens in between.

You can go the other way and convert the integers to strings.

from gcgc import KmerTokenizer

kmer_tokenizer = KmerTokenizer(alphabet="unambiguous_dna")
decoded = kmer_tokenizer.decode(kmer_tokenizer.encode("ATCG"))
print(decoded)

sample output:

['>', 'A', 'T', 'C', 'G', '<']

There's also the vocab for the kmer tokenizer.

from gcgc import KmerTokenizer

kmer_tokenizer = KmerTokenizer(alphabet="unambiguous_dna")
print(kmer_tokenizer.vocab.stoi)

sample output:

{'|': 0, '>': 1, '<': 2, '#': 3, '?': 4, 'G': 5, 'A': 6, 'T': 7, 'C': 8}

Name		Name	Last commit message	Last commit date
Latest commit History 255 Commits
.github		.github
build/hub		build/hub
docs		docs
gcgc		gcgc
notebooks		notebooks
.coveragerc		.coveragerc
.cz.toml		.cz.toml
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.isort.cfg		.isort.cfg
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
Taskfile.yml		Taskfile.yml
docker-compose.yml		docker-compose.yml
mkdocs.yml		mkdocs.yml
mypy.ini		mypy.ini
pylintrc		pylintrc
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GCGC

Installation

Documentation

Quick Start

About

Releases 21

Packages

Languages

License

tshauck/gcgc

Folders and files

Latest commit

History

Repository files navigation

GCGC

Installation

Documentation

Quick Start

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 21

Packages 0

Languages

Packages