Tutorials

Enjoy This Site? Join Our Remote R/Bioinformatics Classes

Note: These tutorials are incomplete. More complete versions are being made available for our members. Sign up for free.

Paper - Detection of ultra-rare mutations by next-generation sequencing

Michael W. Schmitta,
Scott R. Kennedya,
Jesse J. Salka,
Edward J. Foxa,
Joseph B. Hiattb, and
Lawrence A. Loeb

http://www.pnas.org/content/early/2012/07/31/1208715109

Abstract

Data processing section -

Data Processing. Reads with intact Duplex Tags will consist of a 12-nucleotide random sequence, followed by a 5-nucleotide fixed sequence immediately upstream of captured DNA sequence. These reads were identified by filtering out reads that lack the expected fixed sequence at positions 13–17. The 12-nucleotide tag sequences from both the forward and reverse sequencing reads were computationally added to the read header to result in a combined 24-nt tag for each read, and the 5-nucleotide fixed sequence was removed. The first 4 nucleotides following the fixed adapter sequence were also removed to eliminate errors introduced during fragment end repair and ligation. Reads were then aligned to the reference genome with the Burrows-Wheeler aligner (BWA) and nonmapping reads were discarded. The entire human genome sequence (hg19) was used as reference for the mitochondrial DNA experiment, and reads that mapped to chromosomal DNA were removed. Reads sharing identical tag sequences were then grouped together and collapsed to consensus reads. Sequencing positions were discounted if the consensus group covering that position consisted of fewer than three members or if fewer than 90% of the sequences at that position in the consensus group had the identical sequence. A minimum group size of three was selected because next-generation sequencing systems have an average base calling error rate of ∼1/100. Requiring the same base to be identified in three distinct reads decreases the frequency of single-strand consensus sequence (SSCS) errors arising from base-call errors to (1/100)3 = 1 × 10−6, which is below the frequency of spontaneous PCR errors that fundamentally limit the sensitivity of SSCSs. The requirement for 90% of sequences to agree to score a position is a highly conservative cutoff. For example, with a group size of eight, a single disagreeing read will lead to 87.5% agreement and the position will not be scored. If all groups in an experiment are of size nine or less, this cutoff will thus require perfect agreement at any given position to score the position. We anticipate that further development of our protocol may allow for less stringent parameters to be used to maximize the number of SSCS and duplex consensus sequence (DCS) reads that can be obtained from a given experiment. Consensus reads were realigned with the BWA. The consensus sequences were then paired with their strand mate by grouping each 24-nucleotide tag of form αβ in read 1 with its corresponding tag of form βα in read 2. Resultant sequence positions were considered only when information from both DNA strands was in perfect agreement. An overview of the data processing workflow is provided below.

Overview of Duplex Sequencing Data Processing. i) Discard reads that do not have the 5 nucleotide fixed sequence CAGTA present after exactly 12 random nucleotides, which comprise the Duplex Tag sequence. ii) Combine the 12 nucleotide tags from read 1 and read 2 and transfer the combined 24-nucleotide tag sequence into the read header. iii) Discard tags with inadequate complexity (i.e., those with >10 consecutive identical nucleotides). iv) Remove the 5-nucleotide fixed sequence. v) Trim an additional 4 nucleotides from the 5′ ends of each read pair (sites of error prone ligation and end repair). vi) Align reads to the reference genome and discard nonmapping reads. vii) Group together reads that have identical 24-nt tags, representing PCR duplicates of an individual single-stranded DNA fragment. viii) Collapse tag families to SSCS reads, scoring only positions represented by three or more PCR duplicates and having >90% sequence identity among the duplicates. ix) Realign reads to the reference genome. x) For each read in read 1 file having tag sequence of format αβ, group with corresponding DCS partner in read 2 file with tag sequence of format βα. xi) Only score positions with identical sequence among both DCS partners.

Example: Duplex Sequencing Tag Pairs. Consider the 4-nucleotide tags below, with flow cell sequences 1 and 2 in the locations marked and dashes representing a ligated DNA fragment. The Duplex Sequencing adapters actually contain 12-nucleotide Duplex Tags. Shorter tags are used here for clarity: 5′ 1-TAAC————TCCG-2 3′ 3′ 2-ATTG————AGGC-1 5′. The same molecules are shown again here, but with the lower strand now written in the 5′ → 3′ direction: 5′ 1-TAAC————TCCG-2 3′ 5′ 1-CGGA————GTTA-2 3′. These molecules are then PCR amplified and sequenced. They will yield the following reads: the “top” strand: 5′ 1-TAAC————TCCG-2 3′ will give: read 1 file: TAAC—— read 2 file: CGGA——. Combining the read 1 and read 2 tags will produce the tag sequence: TAACCGGA the “bottom” strand: 5′ 1-CGGA————GTTA-2 3′ will give: read 1 file: 1-CGGA—— read 2 file: 2-TAAC——. Combining the read 1 and read 2 tags will produce the tag sequence: CGGATAAC. Note that the combined tags are of form αβ (read 1) and βα (read 2). The key concept is that read 2 is read by the sequencer as the complement of the strand containing read 1.

Example: Orientation of Paired Strand Mutations in Duplex Sequencing. In the initial DNA duplex shown above, now consider a mutation “x” paired to complementary nucleotide “y” that is on the “left” side of the DNA duplex: 1-TAAC—x———————TCCG-2 2-ATTG—y———————AGGC-1. x will appear in read 1, and the complementary mutation on the opposite strand, y, will be seen in read 2. However, the mutation will appear specifically as x in both the read 1 and read 2 data, because y in read 2 is read out as x by the sequencer owing to the asymmetric nature of the sequencing primers, which generate the complementary sequence of the “lower” strand during read 2 as opposed to the direct sequence of the “top” strand during read 1.

If the identity of a base fails to match between the two reads, the position is considered undefined and is replaced by an “N” in the final sequence. For instance, with tag sequences denoted α and β, the sequence αβ-AACTGT in read 1 and βα-AAGTGT in read 2 would result in a final sequence of AANTGT.