HiC_Genome_Assembler

Code repository for a HiC based genome assembly algorithm I wrote during my time in the Baumann lab. Note, that documentation and code may change slightly between now and publication as bugs are encountered and fixed. https://www.baumannlab.org/links.html

Algorithm overview

This is an algorithm designed to assemble a near chromosome scale genome from HiC like data, and an already established primary genome assembly. The algorithm attempts to first group scaffolds to their respective chromosomes and then attempts to order and orient the grouped scaffolds with respect to one another
Our algorithm is dependent on the output of HiCPro, a well established bioinformatics software package that aligns HiC data to a primary genome assembly, removes artifacts, and normalizes the resulting data via a matrix balancing method known as ICED normalization. This particular approach to HiC data processing breaks the up the primary genome assembly into non-overlapping equal sized pieces. The size of these pieces is commonly referred to as the “resolution size”. Read pairs are then counted between any two pieces of the genome, and the resulting square matrix is then normalized to generate the input contact map for our assembly algorithm, which is broken up into two primary phases.
Briefly, the first phase of the algorithm attempts to cluster the genomic loci, defined by the contact map, into discrete chromosome groups by contact values alone. In other words, scaffold information is not considered at this point of the algorithm (apart from the initial alignment step), but only contact values between pairs of loci. Moreover, this phase does not require a pre-defined number of chromosomes to cluster within the data. Instead, the number of chromosomes is found dynamically by the algorithms employed. By not incorporating scaffold information, and a pre-defined chromosome count, our process allows for an unbiased initial approach in clustering genomic loci into chromosome sized groups. After this process, scaffold information can then be leveraged to assess the accuracy of the algorithm as well as identify potential errors in the primary genome assembly itself.
The second phase of the algorithm then attempts to order and orient scaffolds within the found chromosome groupings. Amongst all the different ordering and orientation possibilities, we define the best permutation as the maximum value of a function that weights contact values that are close in linear space more heavily than values that are further away. We employ several different algorithms in a sequential fasion in an attempt to find the optimal ordering.

General usage and install instructions

Code is entirely python based and can be run by downloading the HIC_ASSEMBLER directory and installing the relevant packages provided in the packageInstallCommands.txt file. Additionally, it is recommended to create a new enviroment before running any installs to avoid complications. Provided is a working example config file (hicAssembler_config_workingExample.txt), which is intended to provide the user with a template, but should be altered to fit ones needs. Once the config file has been updated with the relevant settings / parameters, one can run the following command(s) to run the program

Imporant note - Scaffolds that are smaller than the resulotion size can contribute a lot of noise to the contact map. Therefore, we found for our genomes (~1.6Gb in size) that using a resolution size of 100-500Kb and removing scaffolds < 10Kb in size prior to ICE normaliztion gives the best results. In essence, removal of smaller scaffolds prior to the ICE normalization step can remove a significant amount of noise from the contact map, which should improve assembly results.

To run the entire pipeline in one fell swoop, then run with the following command

python HIC_ASSEMBLER/run_hicAssembler.py -part1 -part2 -part3 -part4 -c file/path/to/hicAssembler_config.txt

Note that -part3 is optional and depends on a validpair file produced by HiCpro as well as data produced by a rescriction enzyme (if your data is of newer hic like data types then this might not be applicable). In this case, set the finalOrderingsFile option in the config file to be the same as the chromosomeOrderFile

python HIC_ASSEMBLER/run_hicAssembler.py -part1 -part2 -part4 -c file/path/to/hicAssembler_config.txt

Note that any part can be run independently, as long as the output from the previous parts has been generated. For example, to just run part2 then run the following command

python HIC_ASSEMBLER/run_hicAssembler.py -part2 -c file/path/to/hicAssembler_config.txt

Part1

Clusters contact map rows together via average clustering, and identifies matrix cut indices via hyper geometric cutting algorithm or HMM. Smaller groups are optionally resolved by modularity maximazation. Initial grouping assessment is also performed

Part2

Orders and orients scaffolds with respect to on another on any given chromsome

Part3

Attempts to correct potentially ambigously oriented scaffolds because of a resolution size that is too large. This step is optional and requires a restriction enzyme cut site file (that is also used by HiC-Pro) along with the valid pairs file produced by HiC-Pro that details read pair mappings

Part4

Writes out the final assembled genome with new gaps introduced as a sequence of 100 "N" characters. Scaffolds that were unable to be grouped, are still written to this new version with their original sequence name intact.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
HIC_ASSEMBLER		HIC_ASSEMBLER
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HiC_Genome_Assembler

Algorithm overview

General usage and install instructions

Part1

Part2

Part3

Part4

About

Releases

Packages

Languages

AO33/HiC_Genome_Assembler

Folders and files

Latest commit

History

Repository files navigation

HiC_Genome_Assembler

Algorithm overview

General usage and install instructions

Part1

Part2

Part3

Part4

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages