Various Developments in Bioinformatics (12/5/2012)

10. The effect of strand bias in Illumina short-read sequencing data h/t: @genetics_blog

We collected 22 breast cancer samples from 22 patients and sequenced their exome using the Illumina GAIIx machine. By comparing the consistency between the genotypes inferred from this sequencing data with the genotypes inferred from SNP chip data, we found that, when using sequencing data, SNPs with extreme strand bias did not have significantly lower consistency rates compared to SNPs with low or no strand bias. However, this result may be limited by the small subset of SNPs present in both the exome sequencing and the SNP chip data. We further compared the transition and transversion ratio and the number of novel non-synonymous SNPs between the SNPs with low or no strand bias and those with extreme strand bias, and found that SNPs with low or no strand bias have better overall quality. We also discovered that the strand bias occurs randomly at genomic positions across these samples, and observed no consistent pattern of strand bias location across samples. By comparing results from two different aligners, BWA and Bowtie, we found very consistent strand bias patterns. Thus strand bias is unlikely to be caused by alignment artifacts. We successfully replicated our results using two additional independent datasets with different capturing methods and Illumina sequencers.

9. New PacBio paper from Baylor: Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology

We present here an automated approach to finishing using long-reads from the Pacific Biosciences RS (PacBio) platform. Our algorithm and associated software tool, PBJelly, (publicly available at https://sourceforge.net/projects/pb-jell?y/) automates the finishing process using long sequence reads in a reference-guided assembly process. PBJelly also provides lift-over co-ordinate tables to easily port existing annotations to the upgraded assembly. Using PBJelly and long PacBio reads, we upgraded the draft genome sequences of a simulated Drosophila melanogaster, the version 2 draft Drosophila pseudoobscura, an assembly of the Assemblathon 2.0 budgerigar dataset, and a preliminary assembly of the Sooty mangabey. With 24 mapped coverage of PacBio long-reads, we addressed 99% of gaps and were able to close 69% and improve 12% of all gaps in D. pseudoobscura. With 4 mapped coverage of PacBio long-reads we saw reads address 63% of gaps in our budgerigar assembly, of which 32% were closed and 63% improved. With 6.8 mapped coverage of mangabey PacBio long-reads we addressed 97% of gaps and closed 66% of addressed gaps and improved 19%. The accuracy of gap closure was validated by comparison to Sanger sequencing on gaps from the original D. pseudoobscura draft assembly and shown to be dependent on initial reference quality.

8. News of the Day

Stanford to rival Broad Institute with Big Data-focused genomics center

7. From Ivory Basement of Michigan

We just posted another pre-submission paper to arXiv.org:

Illumina Sequencing Artifacts Revealed by Connectivity Analysis of Metagenomic Datasets

Authors: Adina Chuang Howe, Jason Pell, Rosangela Canino-Koning, Rachel Mackelprang, Susannah Tringe, Janet Jansson, James M. Tiedje, and C. Titus Brown

Arxiv link

6. Auctioning your Paper using Twitter? Now that is a new idea !!

Richard Smith: Why not auction your paper?

5. A Review Article on Genome Assembly

Genome interpretation and assemblyrecent progress and next steps

With over 50,000 human genomes and exomes resequenced and >600 animal or plant genomes sequenced de novo, generating genome sequence data is becoming increasingly commonplace. The question is whether the tools and infrastructure to analyze these data are keeping up. Nature Biotechnology asked experts in academia and industry to share their thoughts on two of the sequencing field’s key computational challengesassembling genomes and developing pipelines to interpret genomes. The following edited compilation of their responses highlights the need for improved accuracy and centralized standards and the opportunities resulting from the rapid pace of innovation (Box 1).

4. PRIDE database of EMBL/EBI - status update

The Proteomics Identifications (PRIDE) database and associated tools: status in 2013

3. Medical Application of Metagenomics -

Integrated Metagenomics/Metaproteomics Reveals Human Host-Microbiota Signatures of Crohn’s Disease

Crohn’s disease (CD) is an inflammatory bowel disease of complex etiology, although dysbiosis of the gut microbiota has been implicated in chronic immune-mediated inflammation associated with CD. Here we combined shotgun metagenomic and metaproteomic approaches to identify potential functional signatures of CD in stool samples from six twin pairs that were either healthy, or that had CD in the ileum (ICD) or colon (CCD). Integration of these omics approaches revealed several genes, proteins, and pathways that primarily differentiated ICD from healthy subjects, including depletion of many proteins in ICD. In addition, the ICD phenotype was associated with alterations in bacterial carbohydrate metabolism, bacterial-host interactions, as well as human host-secreted enzymes. This eco-systems biology approach underscores the link between the gut microbiota and functional alterations in the pathophysiology of Crohn’s disease and aids in identification of novel diagnostic targets and disease specific biomarkers.

2. A CHiP-seq paper:

iASeq: integrative analysis of allele-specificity of protein-DNA interactions in multiple ChIP-seq datasets

And the top one in our top 10 list is from fisherman Lex of Norway !!

1. Informative Comparison of Sequencing Technologies -

Developments in next generation sequencing a visualisation

‹»Ray Cloud Browser for Viewing de Bruijn Graphs« »Challenges in Assembling Fish Genomes«›