A Generation Lost in the Bazaar

Often I download newly published bioinformatics programs or libraries from the github into my Windows laptop and try to compile them within its Cygwin UNIX environment. Over the years, I noticed that those C/C++ codes tend to fall into two distinct categories -

Codes written in C and come with a Makefile,
Codes written in C/C++ (and sometimes STL library) and come with the whole shebang of Makefile, configure and autoconf. All these extra files are supposed to make the program portable across different computing platforms.

I also noticed that codes in the first category compile without problem, whereas the codes in the second category often fail to compile. Think about it. The codes with extra layers of complexity to ensure portability fail to be portable.

I often wondered why it happened and recently found the best answer in a blog post by Poul-Henning Kamp, a core developer of FreeBSD Unix operating system. He argues that a generation of programmers grew up knowing certain “best practices”, which were actually the worst practices for writing clean and potable code. The primary reason for this confusion was the “bazaar” method for coding promoted by Eric Raymond in 1998.

Here is one example of an ironic piece of waste: Sam Leffler’s graphics/libtiff is one of the 122 packages on the road to www/firefox, yet the resulting Firefox browser does not render TIFF images. For reasons I have not tried to uncover, 10 of the 122 packages need Perl and seven need Python; one of them, devel/glib20, needs both languages for reasons I cannot even imagine.

Further down the shopping list are repeated applications of the Peter Principle, the idea that in an organization where promotion is based on achievement, success, and merit, that organization’s members will eventually be promoted beyond their level of ability. The principle is commonly phrased, “Employees tend to rise to their level of incompetence.” Applying the principle to software, you will find that you need three different versions of the make program, a macroprocessor, an assembler, and many other interesting packages. At the bottom of the food chain, so to speak, is libtool, which tries to hide the fact that there is no standardized way to build a shared library in Unix. Instead of standardizing how to do that across all Unixen—something that would take just a single flag to the ld(1) command—the Peter Principle was applied and made it libtool’s job instead. The Peter Principle is indeed strong in this case—the source code for devel/libtool weighs in at 414,740 lines. Half that line count is test cases, which in principle is commendable, but in practice it is just the Peter Principle at work: the tests elaborately explore the functionality of the complex solution for a problem that should not exist in the first place. Even more maddening is that 31,085 of those lines are in a single unreadably ugly shell script called configure. The idea is that the configure script performs approximately 200 automated tests, so that the user is not burdened with configuring libtool manually. This is a horribly bad idea, already much criticized back in the 1980s when it appeared, as it allows source code to pretend to be portable behind the veneer of the configure script, rather than actually having the quality of portability to begin with. It is a travesty that the configure idea survived.

The 1980s saw very different Unix implementations: Cray-1s with their 24-bit pointers, Amdahl UTS mainframe Unix, a multitude of more or less competently executed SysV+BSD mashups from the minicomputer makers, the almost—but not quite—Unix shims from vendors such as Data General, and even the genuine Unix clone Coherent from the paint company Mark Williams.

The configure scripts back then were written by hand and did things like figure out if this was most like a BSD- or a SysV-style Unix, and then copied one or the other Makefile and maybe also a .h file into place. Later the configure scripts became more ambitious, and as an almost predictable application of the Peter Principle, rather than standardize Unix to eliminate the need for them, somebody wrote a program, autoconf, to write the configure scripts.

Today’s Unix/Posix-like operating systems, even including IBM’s z/OS mainframe version, as seen with 1980 eyes are identical; yet the 31,085 lines of configure for libtool still check if <sys/stat.h> and exist, even though the Unixen, which lacked them, had neither sufficient memory to execute libtool nor disks big enough for its 16-MB source code.

How did that happen? Well, autoconf, for reasons that have never made sense, was written in the obscure M4 macro language, which means that the actual tests look like this:

## Whether `make' supports order-only prerequisites.
AC_CACHE_CHECK([whether ${MAKE-make} supports order-only prerequisites],
  [lt_cv_make_order_only],
  [mkdir conftest.dir
   cd conftest.dir
   touch b
   touch a
cat >confmk << 'END'
a: b | c
a b c:
       touch $[]@
END
  touch c
  if ${MAKE-make} -s -q -f confmk >/dev/null 2>&1; then
    lt_cv_make_order_only=yes
  else
    lt_cv_make_order_only=no
  fi
  cd ..
  rm -rf conftest.dir
])
if test $lt_cv_make_order_only = yes; then
  ORDER='|'
else
  ORDER=''
fi
AC_SUBST([ORDER])

Needless to say, this is more than most programmers would ever want to put up with, even if they had the skill, so the input files for autoconf happen by copy and paste, often hiding behind increasingly bloated standard macros covering “standard tests” such as those mentioned earlier, which look for compatibility problems not seen in the past 20 years.

This is probably also why libtool’s configure probes no fewer than 26 different names for the Fortran compiler my system does not have, and then spends another 26 tests to find out if each of these nonexistent Fortran compilers supports the -g option.

The blog of Poul-Henning Kamp has many other interesting posts, and I encourage readers to take a look.

‹»How does Multi-threaded Code Run in Assembly Language?« »Salmonberry Genomics by High-school Students«›