Note: These tutorials are incomplete. More complete versions are being made available for our members. Sign up for free.

HDF5 Format

HDF5 is a data format designed by National Center for Supercomputing Applications at UIUC to store large scientific data sets in hard-drive and access them rapidly. The PacBio raw reads come in the HDF5 format.

Q. What is HDF?

HDF stands for Hierarchical Data Format. If those three words do not mean much to you, wiki has the best explanation of what they mean.

Q. Cannot we just stick the data into SQL databases instead of going through all the trouble of learning a new format?

SQL database and HDF solve two different problems. HDF is for very large data sets that have some kind of uniform structure. For example, let us say you have 500 Gb of sequences in FASTA format with average sequence length of 5000 nucleotides. Storing the data in SQL database and running the access commands in client-server mode reduces speed of access. So, it is preferable to store the entire data set in hard drive and access locally. HDF works well on such data. It used B-TREEs to index the data sets.

On the other hand, let us say you have 20 different data sets with genes, expression information in five tissues, functional annotation, homology with other organisms, etc. and you like to ask questions such as ‘which gene is present in human and mouse, expressed in human liver but not expressed in mouse brain, and has some keyword match with cancer’. SQL works better for such complex queries.

Q. What is HDF5?

It is much improved version of original HDF format.

  1. The best source to learn about the format is HDF5 group.

  2. User guide located here describes C APIs to read HDF5 files.

  3. If you use R, there is already a package named pdh5 to interact directly with various PacBio libraries (suggested by reader Jim). Also, the package h5r suggested by readers Mengjuei Hsieh and Jim should be helpful. We originally identified hdf5 package in R to read HDF5 files, but the other two suggestions may be better.

  4. If you use python, please try this source.

We also found this source helpful.

Q. I am a PERL user. What should I do?

Please google ‘go hang yourself’ and follow any of the links.

Here is a Perl alternative to deal with HDF5 format [Thanks Inti Pedroso.]


Web Statistics