Draft bed2zarr code #281

percyfal · 2024-09-17T14:44:37Z

Draft code to convert Bed3-format to Zarr. Addresses sgkit-dev/sgkit#1219 where we briefly discuss the need for a tool to convert bed to Zarr. As suggested, I started a draft (at least that was my interpretation, or would you have preferred an issue to start with @jeromekelleher?).

Before going too far in development, I submit this draft to discuss some of the issues I have at the moment:

currently the tool throws an error if contig lengths are undefined; this information is not available in BED files, but could be supplied by passing a fasta index as an argument. That should probably go in a different tool (e.g. fai2zarr) for filling in missing contig lengths? There are therefore only two small tests at the moment but I plan to add more once we agree on the implementation.
I am still a bit confused by Zarr terminology. The tool CLI is currently bed2zarr [OPTIONS] BED_PATH ZARR_PATH BED_ARRAY where BED_ARRAY is the name of the Zarr dataset.
By default BED_ARRAY is bed_mask where the BED file is store as a 0/1-array, along with bed_mask_contig which contains the contig for each site, modeled after variant_contig. Both these arrays equal the total genome length.
Should there be a specification for these array types, either in a schema or other format?

I guess one could add support for other BED formats later on. Here, I focus on the more specific task of generating 0/1-based sequence masks to indicate missing data / genome accessibilty.

Draft code to convert Bed3-format to Zarr.

jeromekelleher · 2024-09-18T10:13:41Z

Thanks for this @percyfal, and welcome to sgkit-dev 👋!

Some high-level thoughts:

Can we use pandas or numpy to parse the BED file? I don't want to get into the business of text parsing, it's tedious and hard to do well/fast.
It looks like you are trying to integrate the BED data into an existing VCF Zarr store? I would think that first making a 1-1 mapping of the BED data into it's equivalent Zarr would be a simpler approach, and then make the problem of integrating the different types of data together the problem of a different tool (keeping bio2zarr focused on this one-to-one literal translation of bio formats into Zarr).
As a general guiding principle, we want to do what htslib does. So, we interpret a given BED file in the same way that it does (rather that what we think the Right Way should be).

Does that clarify at all?

jeromekelleher · 2024-09-18T10:15:27Z

As a follow-up, we probably would want to write a spec somewhere that lays out exactly the array names etc, but I don't think there's any need initially. Let's get something working first and spec it out later.

percyfal · 2024-09-18T10:28:34Z

Thanks for the feedback. Let me address your points one by one.

sure - for simplicity I'll use pandas for testing from now on to then see if numpy is necessary later on
Agreed, this would be simpler: generate a separate Zarr store for bed and then have a separate tool to combine with, say, VcfZarr store? Then the spec you mentioned could be tied to a store defined by a BedZarr (say) class, modeled on VcfZarr? This approach would require an additional required option, a fasta index or file with genome lengths (genome_info), since we want to make sure the arrays map to every nucleotide position in the reference. I wonder if chromosomes should be stored as different datasets? In any case, this would make it easy to add any number of Bed datasets, from masks to genomic features (e.g., exons, CDS).
this I would need to read up on

Once this tool works, my long-term wish is to have functionality in sgkit that corrects for summary statistics for accessible sites (be it in windows or not) but that also summarizes statistics across features from annotations. I would be happy to draft that too if that is a good fit, but I think in this case opening a discussion on sgkit prior to development would be beneficial?

percyfal · 2024-09-18T10:31:04Z

To clarify: with chromosomes as different datasets I mean having a group for a bed feature (say mask) where a dataset is created within the group, instead of as in the current draft where I make one long array equal to the total genome length. Some organisms I work on have very fragmented genomes (+100k scaffolds/contigs).

jeromekelleher · 2024-09-18T13:12:31Z

Agreed, this would be simpler: generate a separate Zarr store for bed and then have a separate tool to combine with, say, VcfZarr store? Then the spec you mentioned could be tied to a store defined by a BedZarr (say) class, modeled on VcfZarr? This approach would require an additional required option, a fasta index or file with genome lengths (genome_info), since we want to make sure the arrays map to every nucleotide position in the reference. I wonder if chromosomes should be stored as different datasets? In any case, this would make it easy to add any number of Bed datasets, from masks to genomic features (e.g., exons, CDS)

I'm hoping that BedZarr can be as lightweight as possible and purely just take the input text file and convert it to a Zarr with the corresponding arrays. I think we can then place the burden checking the validity of these intervals on annotating a VCF Zarr with the things we're interested in.

So, suppose we have a BED that specifies an accessibility mask, accessibility.bed. The workflow might look like:

# convert on the command line
$ bed2zarr accessibility.bed accessibility.zarr

then in sgkit we have something like

access_zarr = zarr.open("accessibility.zarr")
# NB: These are names from the top of my head, not an actual proposal!
ds = sgkit.add_accessibility_mask(ds, start=access_zarr["start"], end=access_zarr["end"])

That is, sgkit just takes the start and end coordinates of the intervals as arrays, and is decoupled from the BedZarr format. Sgkit then adds a new boolean variant_accessibility_mask and marks each variant as true or false, depending on whether it falls in one of those intervals. (Let's brush the details of zero-based vs one-based and close-vs-open intervals aside for now!). That way, there's no need to worry too much about whether the intervals cover the entire genome or not,they just have to cover the variants that are in the dataset.

Does that help clarify?

percyfal · 2024-09-18T13:30:44Z

Ok, I see what you mean; basically convert the BED file columns to separate arrays:

chr1 5 10
chr1 15 20

would translate to

chrom: ['chr1', 'chr1']
start: [5, 15]
end: [10, 20]

in Zarr.

However, I don't think it is sufficient to just mark variants as accessible or not as we also need to keep track of the accessibility of non-variant sites. If you want a genome-wide summary of pi, you need to know the accessibility up unto the end of the chromosome.

VARIANT         *      *       *      * *   
POS            0123456789012345678901234567890
MASK           0000011111000001111100000000000

If *=variant site, 0=accessible, 1=masked, and chopping up in 10bp-windows, the first window has 1 variant site but the actual window size is 5bp, not 10.

I'll focus on the BED conversion for now, provided we agree on the output format, and deal with the other stuff later.

jeromekelleher · 2024-09-18T13:39:18Z

However, I don't think it is sufficient to just mark variants as accessible or not as we also need to keep track of the accessibility of non-variant sites.

I see - well that's a different problem. Let's just convert the file to Zarr first and worry about how to use it later!

- use pandas for reading - write to isolated zarr archive - map BED columns to arrays named after BED specification (hts-specs)

percyfal · 2024-09-19T08:06:56Z

Ok, I updated the code to translate the three mandatory fields chrom, chromStart, and chromEnd (following naming conventions in https://samtools.github.io/hts-specs/BEDv1.pdf) to Zarr arrays. I created a class BedZarrWriter, modeled on VcfZarrWriter, and added placeholder classes and functions for field definitions and metadata (BedMetadata, BedField, mandatory_bed_field_definitions) to mimic the setup in bio2zarr.vcf2zarr.vcz. Do you want it simpler still?

jeromekelleher · 2024-09-19T08:19:50Z

This seems good for a starting point. I guess we should align with VCF Zarr in terms of names, so contig, start and end seems sensible? There's an annoying difference here where we use integer contig IDs in VCZ whereas we're using the string names here. We don't want to use the same name with different types, I guess. How should we handle this do you think?

Can you sketch out (maybe as comments in the file) how you envisage handling BED files with more columns?

percyfal · 2024-09-19T08:52:06Z

This seems good for a starting point. I guess we should align with VCF Zarr in terms of names, so contig, start and end seems sensible? There's an annoying difference here where we use integer contig IDs in VCZ whereas we're using the string names here. We don't want to use the same name with different types, I guess. How should we handle this do you think?

In that case I guess we could adopt the VCZ approach and generate a ID mapping to the chrom field (chrom <-> contig_id), using the IDs in a contig field with length equal to the number of entries and which replaces chrom for this purpose . It disrupts the 1-to-1 mapping for the chrom field though.

Can you sketch out (maybe as comments in the file) how you envisage handling BED files with more columns?

Sure. The remaining columns are well-defined (table 2 in spec) and would make drafting a spec pretty straight-forward:

4 name String [\x20-\x7e]{1,255} Feature description
5 score Int [0, 1000] A numerical value
6 strand String [-+.] Feature strand
7 thickStart Int [0, 2^64 − 1] Thick start position
8 thickEnd Int [0, 2^64 − 1] Thick end position
9 itemRgb Int,Int,Int ([0, 255], [0, 255], [0, 255]) | 0 Display color
10 blockCount Int [0, chromEnd − chromStart] Number of blocks
11 blockSizes List[Int] ([[:digit:]]+,){blockCount−1}[[:digit:]]+,? Block sizes
12 blockStarts List[Int] ([[:digit:]]+,){blockCount−1}[[:digit:]]+,? Block start positions

Would you have the spec end up in a schema or be stored separately, as you do with the VCFZarr spec? This begs the question: I haven't figured out where you set the schema, such as self.schema.samples_chunk_size in VcfZarrWriter.encode_samples?

- guess BED file type - add draft schema - add tests

percyfal added 3 commits September 17, 2024 16:10

Draft bed2zarr code

8e97bc3

Draft code to convert Bed3-format to Zarr.

Fix bed2zarr call

49b6f6a

Rename zarr_field to bed_array

1c195c6

Refactor bed2zarr code base

1e3efa1

- use pandas for reading - write to isolated zarr archive - map BED columns to arrays named after BED specification (hts-specs)

Add initial support for BED3-BED12

f94234e

- guess BED file type - add draft schema - add tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft bed2zarr code #281

Draft bed2zarr code #281

percyfal commented Sep 17, 2024

jeromekelleher commented Sep 18, 2024

jeromekelleher commented Sep 18, 2024

percyfal commented Sep 18, 2024

percyfal commented Sep 18, 2024

jeromekelleher commented Sep 18, 2024 •

edited

Loading

percyfal commented Sep 18, 2024

jeromekelleher commented Sep 18, 2024

percyfal commented Sep 19, 2024

jeromekelleher commented Sep 19, 2024

percyfal commented Sep 19, 2024 •

edited

Loading

Draft bed2zarr code #281

Are you sure you want to change the base?

Draft bed2zarr code #281

Conversation

percyfal commented Sep 17, 2024

jeromekelleher commented Sep 18, 2024

jeromekelleher commented Sep 18, 2024

percyfal commented Sep 18, 2024

percyfal commented Sep 18, 2024

jeromekelleher commented Sep 18, 2024 • edited Loading

percyfal commented Sep 18, 2024

jeromekelleher commented Sep 18, 2024

percyfal commented Sep 19, 2024

jeromekelleher commented Sep 19, 2024

percyfal commented Sep 19, 2024 • edited Loading

jeromekelleher commented Sep 18, 2024 •

edited

Loading

percyfal commented Sep 19, 2024 •

edited

Loading