Ask Sawal

Discussion Forum
Notification Icon1
Write Answer Icon
Add Question Icon

How to parse xml in python?

1 Answer(s) Available
Answer # 1 #

Considering that parsing XML documents using the DOM is arguably the most straightforward, you won’t be that surprised to find a DOM parser in the Python standard library. What is surprising, though, is that there are actually two DOM parsers.

The xml.dom package houses two modules to work with DOM in Python:

The first is a stripped-down implementation of the DOM interface conforming to a relatively old version of the W3C specification. It provides common objects defined by the DOM API such as Document, Element, and Attr. This module is poorly documented and has quite limited usefulness, as you’re about to find out.

The second module has a slightly misleading name because it defines a streaming pull parser, which can optionally produce a DOM representation of the current node in the document tree. You’ll find more information about the pulldom parser later.

There are two functions in minidom that let you parse XML data from various data sources. One accepts either a filename or a file object, while another one expects a Python string:

The triple-quoted string helps embed a multiline string literal without using the continuation character (\) at the end of each line. In any case, you’ll end up with a Document instance, which exhibits the familiar DOM interface, letting you traverse the tree.

Apart from that, you’ll be able to access the XML declaration, DTD, and the root element:

As you can see, even though the default XML parser in Python can’t validate documents, it still lets you inspect .doctype, the DTD, if it’s present. Note that the XML declaration and DTD are optional. If the XML declaration or a given XML attribute is missing, then the corresponding Python attributes will be None.

To find an element by ID, you must use the Document instance rather than a specific parent Element. The sample SVG image has two nodes with an id attribute, but you can’t find either of them:

That may be surprising for someone who has only worked with HTML and JavaScript but hasn’t worked with XML before. While HTML defines the semantics for certain elements and attributes such as or id, XML doesn’t attach any meaning to its building blocks. You need to mark an attribute as an ID explicitly using DTD or by calling .setIdAttribute() in Python, for example:

However, using a DTD isn’t enough to fix the problem if your document has a default namespace, which is the case for the sample SVG image. To address this, you can visit all elements recursively in Python, check whether they have the id attribute, and indicate it as their ID in one go:

Your custom set_id_attribute() function takes a parent element and an optional name for the identity attribute, which defaults to "id". When you call that function on your SVG document, then all children elements that have an id attribute will become accessible through the DOM API:

Now, you’re getting the expected XML element corresponding to the id attribute’s value.

Using an ID allows for finding at most one unique element, but you can also find a collection of similar elements by their tag name. Unlike the .getElementById() method, you can call .getElementsByTagName() on the document or a particular parent element to reduce the search scope:

Notice that .getElementsByTagName() always returns a list of elements instead of a single element or None. Forgetting about it when you switch between both methods is a common source of errors.

Unfortunately, elements like that are prefixed with a namespace identifier won’t be included. They must be searched using .getElementsByTagNameNS(), which expects different arguments:

The first argument must be the XML namespace, which typically has the form of a domain name, while the second argument is the tag name. Notice that the namespace prefix is irrelevant! To search all namespaces, you can provide a wildcard character (*).

Once you locate the element you’re interested in, you may use it to walk over the tree. However, another jarring quirk with minidom is how it handles whitespace characters between elements:

The newline characters and leading indentation are captured as separate tree elements, which is what the specification requires. Some parsers let you ignore these, but not the Python one. What you can do, however, is collapse whitespace in such nodes manually:

Note that you also have to .normalize() the document to combine adjacent text nodes. Otherwise, you could end up with a bunch of redundant XML elements with just whitespace. Again, recursion is the only way to visit tree elements since you can’t iterate over the document and its elements with a loop. Finally, this should give you the expected result:

Elements expose a few helpful methods and properties to let you query their details:

For instance, you can check an element’s namespace, tag name, or attributes. If you ask for a missing attribute, then you’ll get an empty string ('').

Dealing with namespaced attributes isn’t much different. You just have to remember to prefix the attribute name accordingly or provide the domain name:

Strangely enough, the wildcard character (*) doesn’t work here as it did with the .getElementsByTagNameNS() method before.

More Questions