biopython v1.71.0 Bio.Entrez.Parser

Parser for XML results returned by NCBI’s Entrez Utilities.

This parser is used by the read() function in Bio.Entrez, and is not intended be used directly.

The question is how to represent an XML file as Python objects. Some XML files returned by NCBI look like lists, others look like dictionaries, and others look like a mix of lists and dictionaries.

My approach is to classify each possible element in the XML as a plain string, an integer, a list, a dictionary, or a structure. The latter is a dictionary where the same key can occur multiple times; in Python, it is represented as a dictionary where that key occurs once, pointing to a list of values found in the XML file.

The parser then goes through the XML and creates the appropriate Python object for each element. The different levels encountered in the XML are preserved on the Python side. So a subelement of a subelement of an element is a value in a dictionary that is stored in a list which is a value in some other dictionary (or a value in a list which itself belongs to a list which is a value in a dictionary, and so on). Attributes encountered in the XML are stored as a dictionary in a member .attributes of each element, and the tag name is saved in a member .tag.

To decide which kind of Python object corresponds to each element in the XML, the parser analyzes the DTD referred at the top of (almost) every XML file returned by the Entrez Utilities. This is preferred over a hand- written solution, since the number of DTDs is rather large and their contents may change over time. About half the code in this parser deals with parsing the DTD, and the other half with the XML itself.