Hifiasm Output¶
Output files¶
In general, hifiasm generates the following assembly graphs in the GFA format:
`prefix`.r_utg.gfa
: haplotype-resolved raw unitig graph. This graph keeps all haplotype information.`prefix`.p_utg.gfa
: haplotype-resolved processed unitig graph without small bubbles. Small bubbles might be caused by somatic mutations or noise in data, which are not the real haplotype information. Hifiasm automatically pops such small bubbles based on coverage. The option--hom-cov
affects the result. See homozygous coverage setting for more details. In addition, the option-p
forcedly pops bubbles.`prefix`.p_ctg.gfa
: assembly graph of primary contigs. This graph includes a complete assembly with long stretches of phased blocks.`prefix`.a_ctg.gfa
: assembly graph of alternate contigs. This graph consists of all contigs that are discarded in primary contig graph.`prefix`.*hap*.p_ctg.gfa
: phased contig graph. This graph keeps the phased contigs.
Hifiasm outputs *.r_utg.gfa
and *.p_utg.gfa
in any cases. Specifically, hifiasm outputs the following assembly graphs in trio-binning mode:
`prefix`.dip.hap1.p_ctg.gfa
: fully phased paternal/haplotype1 contig graph keeping the phased paternal/haplotype1 assembly.`prefix`.dip.hap2.p_ctg.gfa
: fully phased maternal/haplotype2 contig graph keeping the phased maternal/haplotype2 assembly.
With Hi-C partition options, hifiasm outputs:
`prefix`.hic.p_ctg.gfa
: assembly graph of primary contigs.`prefix`.hic.hap1.p_ctg.gfa
: fully phased contig graph of haplotype1 where each contig is fully phased.`prefix`.hic.hap2.p_ctg.gfa
: fully phased contig graph of haplotype2 where each contig is fully phased.`prefix`.hic.a_ctg.gfa
(optional with--primary
): assembly graph of alternate contigs.
Hifiasm generates the following assembly graphs only with HiFi reads in default:
`prefix`.bp.p_ctg.gfa
: assembly graph of primary contigs.`prefix`.bp.hap1.p_ctg.gfa
: partially phased contig graph of haplotype1.`prefix`.bp.hap2.p_ctg.gfa
: partially phased contig graph of haplotype2.
If the option --primary
or -l0
is specified, hifiasm outputs:
`prefix`.p_ctg.gfa
: assembly graph of primary contigs.`prefix`.a_ctg.gfa
: assembly graph of alternate contigs.
For each graph, hifiasm also outputs a simplified version (*noseq*gfa
) without sequences for the ease of visualization. The coordinates of low quality regions are written to *lowQ.bed
in BED format.
The concepts of different types of assemblies can be found here.
Output file formats¶
Hifiasm broadly follows the specification for GFA 1.0. There are several fields that are specifically used by hifiasm. For S
segment line:
rd:i:
: read coverage. It is calculated by the reads coming from the same contig/unitig.
Hifiasm outputs A
lines including the information of reads which are used to construct contig/unitig. Each A
line is plain-text, tab-separated, and the columns appear in the following order:
Col |
Type |
Description |
---|---|---|
1 |
string |
Should be always |
2 |
string |
Contig/unitig name |
3 |
int |
Contig/unitig start coordinate of subregion constructed by read |
4 |
char |
Read strand: “+” or “-” |
5 |
string |
Read name |
6 |
int |
Read start coordinate of subregion which is used to construct contig/unitig |
7 |
int |
Read end coordinate of subregion which is used to construct contig/unitig |
8 |
id:i:int |
Read ID |
9 |
HG:A:char |
Haplotype status of read. |
Hifiasm log interpretation¶
Hifiasm prints several information for quick debugging, including:
k-mer plot: showing how many k-mers appear a certain number of times. For homozygous samples, there should be one peak around read coverage. For heterozygous samples, there should two peaks, where the smaller peak is around the heterozygous read coverage and the larger peak is around the homozygous read coverage. For example, issue10 indicates the heterozygous read coverage and the homozygous read coverage are 28 and 57, respectively. Issue49 is another good example. Weird k-mer plot like issue93 is often caused by insufficient coverage or presence of contaminants.
homozygous coverage: coverage threshold for homozygous reads. Hifiasm prints it as:
[M::purge_dups] homozygous read coverage threshold: X
. If it is not around homozygous coverage, the final assembly might be either too large or too small. To fix this issue, please set--hom-cov
to homozygous coverage.number of het/hom bases: how many bases in unitig graph are heterozygous and homozygous during Hi-C phased assembly. Hifiasm prints it as:
[M::stat] # heterozygous bases: X; # homozygous bases: Y
. Given a heterozygous sample, if there are much more homozygous bases than heterozygous bases, hifiasm fails to identify correct coverage threshold for homozygous reads. In this case, please set--hom-cov
to homozygous coverage.