#Biodata14 Conference - Twitter Summary and Links
The Biodata14 conference is taking place at CSHL (picture from @gladrandomgraph). It is a new conference to focus on bioinformatics and ‘big data’ aspects of data analysis. The Readers may follow #biodata14 hashtag in twitter to get a broad overview of the topics and discussions at the conference.
This link has a list of all talks and posters. Overall there were 43 talks and 75 posters this time. Our blog covered the publications of many speakers and authors from the conference in the past. Below we give a list of talks, and are in the process of updating each section with relevant links and any additional information we found from twitter. The information below is partial and we are modifying them.
For posters, if any author is interested in including his/her one in this blog post, please tweet the link to us at @homolog_us and we will include here.
@lisafederer Fun fact: about 100,000 whole human genomes have been sequenced to date. #biodata14
Talks -
-————————————————-
Haussler, D.H. - Global exchange of human genetic data for medicine and research
Haussler group at UCSC has been working on developing algorithms to analyze large number of human genomes together. We previously covered some of those algorithms in the following posts.
HAL: a Hierarchical Format for Storing and Analyzing Multiple Genome Alignments
Mapping to a Graph-style Reference Genome Arxiv Paper
One important point of his keynote talk is that ‘reference genome’ is neither one genome, or is a linear entity. Each individual has unique genome and the ‘reference genome’ is just a mosaic of all. Therefore, his group is developing tools to represent genomes using graphs instead of linear character array.
Genome not being ‘linear’ is a recurrent theme in many of the other talks as well.
-————————————————-
Church, D.M. - Analog reporting in a digital age
Deanna Church has been working on human reference genome for a long time. Again, the ‘an assembly is not a genome its a MODEL of a genome’ comes up in her talk. Relevant tweets -
Lisa Federer ?@lisafederer Nov 6
Diagnostic exome sequence only solves the question 25-50% of the time #biodata14
Avinash ?@gatoravi Nov 6
DC: convention to shift indels farthest right in clinical data. discrepancy with vcf leads to duplicate records. #biodata14
Olga Botvinnik ?@olgabot Nov 6
.@deannachurch reminding us that genotype-phenotype associations started long before the genome project, with cDNAs #biodata14
Avinash ?@gatoravi Nov 6
great talk from DC. succinctly highlights many challenges with variant reporting that a lot of us face. #biodata14 biodata14
-————————————————-
Charlop-Powers, Z.- Creating a drug atlasApplications of big data to natural product drug discovery
We have not covered much on this topic, but here is a media link.
Searching for drugs in dirt, researchers call on citizen scientists
Microbes are not only a rich source of disease, but also a rich source of medicines, and experts think many life-saving compounds produced by as-yet- unnamed bacteria are awaiting discovery. But they dont always give up their secrets easily. Researchers must know where to look to find promising bacteria, and how to get them to grow in the lab, the traditional route to identifying potentially valuable molecules they produce.
Researchers in Sean Bradys Laboratory of Genetically Encoded Small Molecules are working on a way around these roadblocks. By using genomic sequencing technology, they can investigate the organisms that live in habitats like soil without having to grow the microbes in the lab. They are using this information to map out the location of gene clusters they believe may encode novel antibiotics, and, with help from citizen scientists around the country, they are hoping to process soil samples from areas they would never be able to visit on their own.
-————————————————-
Chatterjee, S. - A large database informatics method for characterization of the human gut microbial proteome
Relevant tweets -
Rachel Melamed ?@rdmelamed 1d1 day ago
SC: phylogenetic diversity of microbiome is huge, but the protein function is conserved #biodata14
deannachurch ?@deannachurch 1d1 day ago
SC: microbial composition varies between individuals but gene fxn seems to be conserved. #biodata14
-————————————————-
Aghamirzaie, D. - An accurate support vector machine classifier for assessing coding potential of transcripts using several sequential and structural features
-————————————————-
Allen, J.E. - Designing new algorithms for emerging data-intensive computing architectures to improve the speed and accuracy of shotgun metagenomic analysis
-————————————————-
Carneiro, M.O. - Native GATKwhy you should care about performance
GATK is immensely popular as you can tell from the number of tweets (Read them from bottom to top for context).
Matt Massie ?@matt_massie 2h2 hours ago
Kudos to the @broadinstitute for sharing the new GATK C++ engine w/an MIT license. A step in the right direction. @gatk_dev #biodata14
Avinash ?@gatoravi 2h2 hours ago
MC addresses questions from the floor on the closed/open sourcedness of GATK. #biodata14
Morgan Taschuk ?@morgantaschuk 2h2 hours ago
Next on #biodata14, Andrew Warren with Pan-genome graphs for bacteria & the web
Olga Botvinnik ?@olgabot 2h2 hours ago
@markgerstein @mauricinho as was discussed, not truly “open source” bc it limits who is able to contribute to the development #biodata14
Mark Gerstein ?@markgerstein 2h2 hours ago
Q for @mauricinho: Why GATK went from open source to not? A: Broad lawyers were experimenting! Now back to open source #biodata14
Rob Patro ?@nomad421 2h2 hours ago
#biodata14 Great talk on GATK by Mauricio Carneiro. Yup, C++ is still way faster than Java, at least for what GATK does.
Wendy Demos ?@DemosWM 2h2 hours ago
MC faster GATK is coming! #biodata14
Dan Evans ?@DanEvans0 2h2 hours ago
#biodata14 New GATK I/O C++ library is called “gamgee”, ‘cause it’s Sam-wise. Get it?
Avinash ?@gatoravi 2h2 hours ago
MC: pairhmm package freely available to use in other packages. #biodata14
Han Fang ?@Han_Fang_ 2h2 hours ago
@mauricinho : In gamgee, reading bam files is 17X faster, and mark duplicates is 5X faster. #biodata14
Morgan Taschuk ?@morgantaschuk 2h2 hours ago
MC: Gamgee source code available: https://github.com/broadinstitute/gamgee #biodata14
Jason Pitt ?@JasonJPitt 2h2 hours ago
MC: New joint C++/Java implementation of GATK speeds up HaplotypeCaller by 9 fold #biodata14
Morgan Taschuk ?@morgantaschuk 2h2 hours ago
MC: GATK C++ gamgee, Reading bam files is 17x faster, reading VCFs is 50x faster, calling varies by tech, from 9x to 720x faster #biodata14
Avinash ?@gatoravi 2h2 hours ago
MC: check out talk from cppcon on performance of gamgee. c++ gatk version worked on. #biodata14
Morgan Taschuk ?@morgantaschuk 2h2 hours ago
MC: Java cannot make use of modern hardware, no access to low-level concepts. Switched to C++ because it allows same. #biodata14
Avinash ?@gatoravi 2h2 hours ago
MC: more than 70% of instructions in GATK are memory access( like most tools) #biodata14
Richard Sever ?@cshperspectives 2h2 hours ago
The brown fat of the internet ;-) MT @morgantaschuk: MC: “data center..a machine whose only job is to turn electricity into heat” #biodata14
Morgan Taschuk ?@morgantaschuk 2h2 hours ago
MC: Accessing memory from CPU takes 100x longer than from L1 cache, many wasted cycles #biodata14
Sara Ballouz ?@SaraBallouz 2h2 hours ago
Carneiro: software and hardware - what’s happened to the arms race? #biodata14
James Taylor ?@jxtx 2h2 hours ago
My slides from #biodata14: https://speakerdeck.com/jxtx/adventures-in-scaling- galaxy-at-biological-data-science-2014#
Morgan Taschuk ?@morgantaschuk 2h2 hours ago
MC: Modern CPUs are “too fast”, difficult to utilize all of the processing power. Need faster memory. #biodata14
Avinash ?@gatoravi 2h2 hours ago
MC slide of memory access times. ( things EEs memorize) #biodata14
Lisa Federer ?@lisafederer 2h2 hours ago
Fun fact: modern CPUs can handle 36 billion instructions per second. Software isn’t able to take advantage of that. #biodata14
Dan Evans ?@DanEvans0 2h2 hours ago
#biodata14 M. Carneiro shows pic of data centre at rest: “This is a machine whose only job is to turn electricity into heat.”
B. Boutros-Blather ?@boutrosblather 2h2 hours ago
why aren’t we using the water vapor emanating from data centers to turn turbines?! #biodata14
Avinash ?@gatoravi 2h2 hours ago
MC: data centers are just steam rooms if processing not done right. #biodata14
Sam Minot ?@sminot 2h2 hours ago
MC: “We’re not doing particle physics here” #biodata14
Morgan Taschuk ?@morgantaschuk 2h2 hours ago
MC: “This is a data center. It’s a machine whose only job is to turn electricity into heat.” General applause. #biodata14
Avinash ?@gatoravi 2h2 hours ago
MC: majority of cpu time spent waiting. usually ‘write to disk’ time. #biodata14
Han Fang ?@Han_Fang_ 2h2 hours ago
@mauricinho : “Native GATKwhy you should care about performance”. Talking about the C++14 library for GATK 4.0 (in progress) #biodata14
B. Boutros-Blather ?@boutrosblather 2h2 hours ago
GATK takes 2 days to process a single genome – which is a feature, not a bug #takeabreak #biodata14
Morgan Taschuk ?@morgantaschuk 2h2 hours ago
MC: It takes 44 hours to process a single genome from alignment through pre- processing. cpu is not efficiently used #biodata14
Avinash ?@gatoravi 2h2 hours ago
MC: highly encourages adopting best practices pipeline. problem - speed.(44hrs before variant calling) #biodata14
Jason Pitt ?@JasonJPitt 2h2 hours ago
MC: Single-sample processing followed by joint (multi-sample) genotyping is crucial for scalability #biodata14
Avinash ?@gatoravi 2h2 hours ago
MC: best practices pipeline most important contribution. #biodata14
Morgan Taschuk ?@morgantaschuk 2h2 hours ago
MC: Broad has active outreach to teach people how to use the GATK properly, videos available on website.https://www.broadinstitute.org/gatk/ #biodata14
0 replies 3 retweets 1 favorite
Morgan Taschuk ?@morgantaschuk 2h2 hours ago
MC: “The whole world started using the GATK. We weren’t quite ready for that.” #biodata14
Avinash ?@gatoravi 2h2 hours ago
MC: -44000 exomes and ~2000 wgs in 2013 at broad. #biodata14
B. Boutros-Blather ?@boutrosblather 2h2 hours ago
carneiro: lean software, dense slides at #biodata14
Morgan Taschuk ?@morgantaschuk 2h2 hours ago
MC: You should care about performance because everyone has stopped caring about performance. noone talks to the hardware folks. #biodata14
Jason Pitt ?@JasonJPitt 2h2 hours ago
MC: Hardware is getting faster while software is getting slower. As programmers, we’re getting lazy #biodata14
GATK is not without critics however. The program has a monopoly. Do the critics perceive it as the next Microsoft? Well, the number of lawyers being involved to tweak various licenses give them such as impression.
James Taylor ?@jxtx 2h2 hours ago
Thanks @StevenSalzberg1 for asking the right question at #biodata14 GATK IS NOT OPEN. Dont use it. Reject papers that do. Bad for science
-————————————————-
Chilton, J.M. - Rapidly bringing software to biologists with Galaxy and Docker
Not sure whether this is the same topic, but here are the slides from James Taylor on Galaxy -
-———————————————–
Cox, A.J. -Compressed indexing of multiple human genomesPractice and applications
Tony Cox from Illumina has been working on BWT-based indexing scheme for large Illumina files so that they can be searched rapidly. We extensively covered his work here along with algorithms and other details. Here are the earliest and latest links.
Academic Bioinformaticians Uncomfortable with Illuminas Publication of Variant Caller
Latest from Tony-Cox BEETL-fastq
Relevant tweets -
Morgan Taschuk ?@morgantaschuk 1h1 hour ago
AC: BEETL available open source http://beetl.github.io/BEETL #biodata14
Mark Gerstein ?@markgerstein 1h1 hour ago
Cox: shows benefits of compressed indexing the reads. Validates by rapidly finding deletion breakpoints on NA12878 #biodata14
Morgan Taschuk ?@morgantaschuk 1h1 hour ago
AC: Compress 152GB gzipped fastq with BEETL-fastq compressed indexing to ~100GB. #biodata14
-————————————————-
De La Vega, F.M. - Scaling up genomic data management, indexing, and analysis for a million genomes
-————————————————-
Dobin, A. STARtoolsUltra-fast comprehensive RNA-seq analysis suit
When the STAR aligner came out, we wrote about it.
STAR: Really Kick-ass RNA-seq Aligner
Now the author develops ‘STARtools’ on top of it to make it even more useful. You can access STAR at this github page.
STARtools appears to do expression analysis, but we are not sure why the tweets discussed about comparison with RSEM. The right comparison should be with Bowtie-Tophat-Cufflinks given that this is reference-based, right?
-————————————————-
Dowling, J. - The secure analysis and storage of genomic data using BiobankCloud
-————————————————-
Fang, H. - Classifying INDELs to reduce calling errors in whole-genome and exome sequencing data
-————————————————-
Felix, V.M. - Open Science Data Framework (OSDF)A system for organizing, accessing, and querying scientific data
-————————————————-
Gerstein, M.B. - A computational framework to prioritize regulatory variants from whole-genome sequencing in cancer
-————————————————-
Ghose, K. - A community-driven framework for scalable and reproducible informatics in the cloud
-————————————————-
Haake, A.R. - User-centered design approaches for visual information processing
-————————————————-
Hwang, T. - An integrative somatic mutation analysis to identify pathways linked with survival outcomes across 19 cancer types
-————————————————-
Kim, M. - Parallel compression of metagenomic sequences via extended Golomb codes
They are using Kraken and other tools to classify metagenomes. Read tweets from bottom to top -
Morgan Taschuk ?@morgantaschuk 2h2 hours ago
MK: MetaCRAM available on http://web.engr.illinois.edu/~mkim158 #biodata14
Avinash ?@gatoravi 2h2 hours ago
MK: 3-6 fold compression improvement over gzip. #biodata14
Sam Minot ?@sminot 2h2 hours ago
Notably, the only disadvantage to MetaCRAM is the compression time, potentially making it a good option for long-term storage #biodata14
Morgan Taschuk ?@morgantaschuk 2h2 hours ago
MK: MetaCRAM is much slower than bzip2, gzip, because of blast to use reference based compression. Compression rates are 2x #biodata14
Sam Minot ?@sminot 2h2 hours ago
MK: MetaCRAM takes about 200X longer than gzip #biodata14
Morgan Taschuk ?@morgantaschuk 2h2 hours ago
MK: for reads and start positions, power law was closest model. #biodata14
Sam Minot ?@sminot 2h2 hours ago
MK: MetaCRAM improves on gzip by 2-3 fold for HMP data #biodata14
Avinash ?@gatoravi 2h2 hours ago
which distribution do i model with ? #classicbioinfo #biodata14
Morgan Taschuk ?@morgantaschuk 2h2 hours ago
MK: extended golomb encoding stores number of division operations between n and m and array of remainders, for power law dist #biodata14
Avinash ?@gatoravi 2h2 hours ago
MK: extended golomb encoding for ints with power law encoding. #biodata14
Avinash ?@gatoravi 2h2 hours ago
MK now gives a quick background about golomb encoding, useful for ints with geometric distribution. #biodata14
Morgan Taschuk ?@morgantaschuk 2h2 hours ago
MK: huffman encoding uses a priori probability distribution to use less bits for more frequently occurring symbols, can be large #biodata14
Avinash ?@gatoravi 2h2 hours ago
MK talks about huffman encoding, assign less bits to more frequent symbols based on pmf. #biodata14
Morgan Taschuk ?@morgantaschuk 2h2 hours ago
MK: 1. classify reads into taxonomy, 2. align reads to “relevant reference” 3. assemble metagenome of other reads, 4. compress #biodata14
B. Boutros-Blather ?@boutrosblather 2h2 hours ago
#metacram talk at #biodata14 is very clear!
Avinash ?@gatoravi 2h2 hours ago
MK: taxonomy classification tools - kraken and metaphyler, picked kraken. #biodata14
Mark Gerstein ?@markgerstein 2h2 hours ago
Kim cites: Data compression for sequencing data
http://www.almob.org/content/8/1/25 A nice review mentioning reference-based compression #biodata14
Han Fang ?@Han_Fang_ 2h2 hours ago
Minji Kim: MetaCRAM, assemble/compress simultaneously, in iterative, parallel manner. #biodata14
Sam Minot ?@sminot 2h2 hours ago
MK: MetaCRAM uses Kraken for taxonomic identification #biodata14
Morgan Taschuk ?@morgantaschuk 2h2 hours ago
MK: MetaCRAM is first do novo, parallel, CRAM-like software specialized for FASTA-format metagenomic read compression #biodata14
-————————————————-
Knight, J. - RMSRun My Samples
James Knight developed the de novo assembler for 454 reads and is now at Yale. We covered his work in the following blog post.
New Bioinformatics Blog to Keep an Eye on James Knight
-————————————————-
Kural, D. - Self-Learning algorithms for millions of genomes
-————————————————-
Langmead, B.T. - Scalable software for uniform analysis of many RNA-seq samples
Ben Langmead has been working on several cutting-edge RNAseq tools. Here are the relevant tweets from his talk -
Mark Gerstein ?@markgerstein 2h2 hours ago
.@BenLangmead: Cloud-scale RNAseq… w/ Myrna
http://genomebiology.com/2010/11/8/r83 AWS run w. Geuvadis gives $.34/Gb, less than $1/sample #biodata14
deannachurch ?@deannachurch 2h2 hours ago
.@BenLangmead wants to make it easy for biologist to reanalyze large scale public data. #biodata14 #reproducibility
deannachurch ?@deannachurch 2h2 hours ago
.@BenLangmead starting the morning session on motivation for building RNA-seq analysis tools. #biodata14
-————————————————-
Lauter, K. - Homomorphic encryption as a tool to preserve privacy in genomic computation
-————————————————-
Lee, H. - Sugarcane genome de novo assembly challenge
-————————————————-
Lovci, M.T. - FlotillaAn open-source toolkit for single-cell RNA-seq data analysis
-————————————————-
Mainzer, L. - Profiling accuracy and performance of human variant calling workflows on BlueWaters
-————————————————-
Margolis, R. - Designing a data discovery index to find and cite data
-————————————————-
Massie, M. - Building fast, petabyte-scale biological data systems
-————————————————-
Piccolo, S.R. - Gene set omic analysisA gene-set analysis approach that can be applied to many omic types
-————————————————-
Pitt, J.J. - Robust scaling of next-generation sequencing analyses using the modular SwiftSeq workflow
-————————————————-
Ratsch, G. - Automatic summarization of cancer clinical notes to understand patient trajectories and the effect of somatic mutations
-————————————————-
Rendon, A. - 100,000 genomes of patients with cancer and rare heritable diseases
-————————————————-
Russell, D.P. - The Open Microscopy EnvironmentOpen image informatics for the biological sciences
-————————————————-
Sadedin, S.P. - CpipeA bioinformatics platform for the analysis of clinical sequencing data in a diagnostic setting
-————————————————-
Sakhanenko, N. - An information theory method for efficient discovery of multivariable dependencies
-————————————————-
Shimizu, K. - Privacy preserving similarity search in biomedical data by homomorphic encryption
-————————————————-
Stombaugh, J.I. - Power Decoder A simulator for the evaluation of pooled shRNA screen performance
-————————————————-
Tan, J. - Learning high-level biological principles from Pseudomonas aeruginosa using denoising autoencoders
-————————————————-
Veeraraghavan, N. - Staging the largest genomic cloud computeAnd living to tell about it
-————————————————-
Warren, A.S. - Bacterial spaghettiPan-genome graphs for the web
-————————————————-
Williams, J. - Unleash your inner data scientistEnabling scalable data driven collaborations with iPlant Cyberinfrastructure
-————————————————-
Wu, T. - Designing genomic data structures for fast computation
Tony Cox ?@coxtonyj 41m41 minutes ago
Thomas Wu - use “discriminating character array” instead of LCP as part of enhanced suffix array #biodata14
Tony Cox ?@coxtonyj 1h1 hour ago
Thomas Wu - use “discriminating character array” instead of LCP as part of enhanced suffix array #biodata14
Zamin Iqbal ?@ZaminIqbal 7m7 minutes ago
@coxtonyj ANy paper on this?
Tony Cox ?@coxtonyj 4m4 minutes ago
@ZaminIqbal Did not seem like it, but will ask him if I get the chance
Morgan Taschuk ?@morgantaschuk 58m58 minutes ago
TW: Speeding GMAP/GSNAP with genomic data representation: added compression, longer k-mers, vertical columns, suffix array #biodata14
-————————————————-
Yates, A.D. - The Ensembl REST API”Gone gamma”