Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft bed2zarr code #281

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

Draft bed2zarr code #281

wants to merge 5 commits into from

Conversation

percyfal
Copy link

Draft code to convert Bed3-format to Zarr. Addresses sgkit-dev/sgkit#1219 where we briefly discuss the need for a tool to convert bed to Zarr. As suggested, I started a draft (at least that was my interpretation, or would you have preferred an issue to start with @jeromekelleher?).

Before going too far in development, I submit this draft to discuss some of the issues I have at the moment:

  1. currently the tool throws an error if contig lengths are undefined; this information is not available in BED files, but could be supplied by passing a fasta index as an argument. That should probably go in a different tool (e.g. fai2zarr) for filling in missing contig lengths? There are therefore only two small tests at the moment but I plan to add more once we agree on the implementation.
  2. I am still a bit confused by Zarr terminology. The tool CLI is currently bed2zarr [OPTIONS] BED_PATH ZARR_PATH BED_ARRAY where BED_ARRAY is the name of the Zarr dataset.
  3. By default BED_ARRAY is bed_mask where the BED file is store as a 0/1-array, along with bed_mask_contig which contains the contig for each site, modeled after variant_contig. Both these arrays equal the total genome length.
  4. Should there be a specification for these array types, either in a schema or other format?

I guess one could add support for other BED formats later on. Here, I focus on the more specific task of generating 0/1-based sequence masks to indicate missing data / genome accessibilty.

@jeromekelleher
Copy link
Contributor

Thanks for this @percyfal, and welcome to sgkit-dev 👋!

Some high-level thoughts:

  1. Can we use pandas or numpy to parse the BED file? I don't want to get into the business of text parsing, it's tedious and hard to do well/fast.
  2. It looks like you are trying to integrate the BED data into an existing VCF Zarr store? I would think that first making a 1-1 mapping of the BED data into it's equivalent Zarr would be a simpler approach, and then make the problem of integrating the different types of data together the problem of a different tool (keeping bio2zarr focused on this one-to-one literal translation of bio formats into Zarr).
  3. As a general guiding principle, we want to do what htslib does. So, we interpret a given BED file in the same way that it does (rather that what we think the Right Way should be).

Does that clarify at all?

@jeromekelleher
Copy link
Contributor

As a follow-up, we probably would want to write a spec somewhere that lays out exactly the array names etc, but I don't think there's any need initially. Let's get something working first and spec it out later.

@percyfal
Copy link
Author

Thanks for the feedback. Let me address your points one by one.

  1. sure - for simplicity I'll use pandas for testing from now on to then see if numpy is necessary later on

  2. Agreed, this would be simpler: generate a separate Zarr store for bed and then have a separate tool to combine with, say, VcfZarr store? Then the spec you mentioned could be tied to a store defined by a BedZarr (say) class, modeled on VcfZarr? This approach would require an additional required option, a fasta index or file with genome lengths (genome_info), since we want to make sure the arrays map to every nucleotide position in the reference. I wonder if chromosomes should be stored as different datasets? In any case, this would make it easy to add any number of Bed datasets, from masks to genomic features (e.g., exons, CDS).

  3. this I would need to read up on

Once this tool works, my long-term wish is to have functionality in sgkit that corrects for summary statistics for accessible sites (be it in windows or not) but that also summarizes statistics across features from annotations. I would be happy to draft that too if that is a good fit, but I think in this case opening a discussion on sgkit prior to development would be beneficial?

@percyfal
Copy link
Author

To clarify: with chromosomes as different datasets I mean having a group for a bed feature (say mask) where a dataset is created within the group, instead of as in the current draft where I make one long array equal to the total genome length. Some organisms I work on have very fragmented genomes (+100k scaffolds/contigs).

@jeromekelleher
Copy link
Contributor

jeromekelleher commented Sep 18, 2024

Agreed, this would be simpler: generate a separate Zarr store for bed and then have a separate tool to combine with, say, VcfZarr store? Then the spec you mentioned could be tied to a store defined by a BedZarr (say) class, modeled on VcfZarr? This approach would require an additional required option, a fasta index or file with genome lengths (genome_info), since we want to make sure the arrays map to every nucleotide position in the reference. I wonder if chromosomes should be stored as different datasets? In any case, this would make it easy to add any number of Bed datasets, from masks to genomic features (e.g., exons, CDS)

I'm hoping that BedZarr can be as lightweight as possible and purely just take the input text file and convert it to a Zarr with the corresponding arrays. I think we can then place the burden checking the validity of these intervals on annotating a VCF Zarr with the things we're interested in.

So, suppose we have a BED that specifies an accessibility mask, accessibility.bed. The workflow might look like:

# convert on the command line
$ bed2zarr accessibility.bed accessibility.zarr

then in sgkit we have something like

access_zarr = zarr.open("accessibility.zarr")
# NB: These are names from the top of my head, not an actual proposal!
ds = sgkit.add_accessibility_mask(ds, start=access_zarr["start"], end=access_zarr["end"])

That is, sgkit just takes the start and end coordinates of the intervals as arrays, and is decoupled from the BedZarr format. Sgkit then adds a new boolean variant_accessibility_mask and marks each variant as true or false, depending on whether it falls in one of those intervals. (Let's brush the details of zero-based vs one-based and close-vs-open intervals aside for now!). That way, there's no need to worry too much about whether the intervals cover the entire genome or not,they just have to cover the variants that are in the dataset.

Does that help clarify?

@percyfal
Copy link
Author

Ok, I see what you mean; basically convert the BED file columns to separate arrays:

chr1 5 10
chr1 15 20

would translate to

chrom: ['chr1', 'chr1']
start: [5, 15]
end: [10, 20]

in Zarr.

However, I don't think it is sufficient to just mark variants as accessible or not as we also need to keep track of the accessibility of non-variant sites. If you want a genome-wide summary of pi, you need to know the accessibility up unto the end of the chromosome.

VARIANT         *      *       *      * *   
POS            0123456789012345678901234567890
MASK           0000011111000001111100000000000

If *=variant site, 0=accessible, 1=masked, and chopping up in 10bp-windows, the first window has 1 variant site but the actual window size is 5bp, not 10.

I'll focus on the BED conversion for now, provided we agree on the output format, and deal with the other stuff later.

@jeromekelleher
Copy link
Contributor

However, I don't think it is sufficient to just mark variants as accessible or not as we also need to keep track of the accessibility of non-variant sites.

I see - well that's a different problem. Let's just convert the file to Zarr first and worry about how to use it later!

- use pandas for reading
- write to isolated zarr archive
- map BED columns to arrays named after BED specification (hts-specs)
@percyfal
Copy link
Author

Ok, I updated the code to translate the three mandatory fields chrom, chromStart, and chromEnd (following naming conventions in https://samtools.github.io/hts-specs/BEDv1.pdf) to Zarr arrays. I created a class BedZarrWriter, modeled on VcfZarrWriter, and added placeholder classes and functions for field definitions and metadata (BedMetadata, BedField, mandatory_bed_field_definitions) to mimic the setup in bio2zarr.vcf2zarr.vcz. Do you want it simpler still?

@jeromekelleher
Copy link
Contributor

This seems good for a starting point. I guess we should align with VCF Zarr in terms of names, so contig, start and end seems sensible? There's an annoying difference here where we use integer contig IDs in VCZ whereas we're using the string names here. We don't want to use the same name with different types, I guess. How should we handle this do you think?

Can you sketch out (maybe as comments in the file) how you envisage handling BED files with more columns?

@percyfal
Copy link
Author

percyfal commented Sep 19, 2024

This seems good for a starting point. I guess we should align with VCF Zarr in terms of names, so contig, start and end seems sensible? There's an annoying difference here where we use integer contig IDs in VCZ whereas we're using the string names here. We don't want to use the same name with different types, I guess. How should we handle this do you think?

In that case I guess we could adopt the VCZ approach and generate a ID mapping to the chrom field (chrom <-> contig_id), using the IDs in a contig field with length equal to the number of entries and which replaces chrom for this purpose . It disrupts the 1-to-1 mapping for the chrom field though.

Can you sketch out (maybe as comments in the file) how you envisage handling BED files with more columns?

Sure. The remaining columns are well-defined (table 2 in spec) and would make drafting a spec pretty straight-forward:

4 name String [\x20-\x7e]{1,255} Feature description
5 score Int [0, 1000] A numerical value
6 strand String [-+.] Feature strand
7 thickStart Int [0, 2^64 − 1] Thick start position
8 thickEnd Int [0, 2^64 − 1] Thick end position
9 itemRgb Int,Int,Int ([0, 255], [0, 255], [0, 255]) | 0 Display color
10 blockCount Int [0, chromEnd − chromStart] Number of blocks
11 blockSizes List[Int] ([[:digit:]]+,){blockCount−1}[[:digit:]]+,? Block sizes
12 blockStarts List[Int] ([[:digit:]]+,){blockCount−1}[[:digit:]]+,? Block start positions

Would you have the spec end up in a schema or be stored separately, as you do with the VCFZarr spec? This begs the question: I haven't figured out where you set the schema, such as self.schema.samples_chunk_size in VcfZarrWriter.encode_samples?

- guess BED file type
- add draft schema
- add tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants