Tutorials

Enjoy This Site? Join Our Remote R/Bioinformatics Classes

Note: These tutorials are incomplete. More complete versions are being made available for our members. Sign up for free.

De Bruijn Graph for Haplotype Differences / Phasing

Each diploid cell carries two copies (haplotypes) of each chromosome, and those copies are nearly identical but has some differences. The de Bruijn graph structure described so far considered those two copies to be identical. The sample preparation techniques for shotgun sequencing do not separate out two copies of the chromosomes. That means the short read libraries can contain reads from one chromosome or other. That is not an issue, when the copies are identical. However, in genomic regions with haplotype differences, the reads are likely to contain two different versions of the same chromosomal regions, whereas the assembly programs tries to collapse those difference into one unique chromosome.

What kind of distortion does haplotype difference introduce to the de Bruijn graph constructed from the reads? To understand, we can conceptually consider one copy of the chromosome as the true chromosome and the other one as error. We already discussed the impact of sequencing errors on de Bruijn graphs, and haplotype differences modify the de Bruijn graph structure in the same manner. One can visualize the graph structure by starting from the genome and drawing the graph based on k-mers. The only difference with earlier examples is that to account for haplotype difference, one needs to draw two near-identical copies of the chromosome side-by-side and then merge them into a combined graph. In regions, where two chromosomes are identical, graph structures will merge. Where the chromosomes are different, two branches of the graph will emerge. Circles in the graph represent k-mer nodes, arrows represent their connections, and dark shaded lines show read distribution. For simplicity, we have shown unidirectional arrows, but de Bruijn graphs of genomes are bidirectional accounting for two strands. As an example, let us say a region in two pairs of chromosomes has difference in a single nucleotide (SNP). The chromosomal regions, their individual de Bruijn graphs and the combined de Bruijn graph are shown below. In the combined graph, the SNP region shows up as a ‘bubble’ pattern.

Interestingly, the graph structure for SNP shown above is not different from sequencing errors (shown below). The difference is in read distribution. In case of sequencing error, erroneous branch is less well-traveled by one or two reads, whereas SNP difference between two chromosomes will show near equal distribution of reads among the branches of the de Bruijn graph.

How about insertion-deletions? As a second example, let us say a region in two pairs of chromosomes has small deletion. The chromosomal regions, their individual de Bruijn graphs and the combined de Bruijn graph are shown below. In the combined graph, the deletion region shows up as a ‘bubble’ pattern, but two branches of the bubble have unequal lengths. We borrowed an old figure on alternatively spliced genes, because graph structures for indels and alt-splices are similar. The readers may find the SNP and deletion regions to be similar to sequencing errors. However, one critical difference is in the weight of the branches. In case of sequencing errors, the errorneous branch is less well-traveled than the correct branch. In case of polymorphism, both branches have substantially similar numbers.