Process XML In Python With ElementTree XPaths And tails

Process XML in Python with ElementTree

By David Mertz, Ph.D. - 2003-12-04 Page: 1 2 3 4 5

XPaths And tails

ElementTree implements a subset of XPath in its .find*() methods. Using this style can be much more concise than nesting code to look within levels of subnodes, especially for XPaths that contain wildcards. For example, if I were interested in all the timestamps of hits to my Web server, I could examine weblog.xml using:

Listing 9. Using XPath to find nested subelements


>>> from elementtree import ElementTree

>>> weblog = ElementTree.parse('weblog.xml').getroot()
>>> timestamps = weblog.findall('entry/dateTime')
>>> for ts in timestamps:
...     if ts.text.startswith('19/Aug'):
...         print ts.text

Of course, for a standard, shallow document like weblog.xml, it is easy to do the same thing with list comprehensions:

Listing 10. Using list comprehensions to find and filter nested subelements


>>> for ts in [ts.text for e in weblog
...            for ts in e.findall('dateTime')
...            if ts.text.startswith('19/Aug')]:
...     print ts

Prose-oriented XML documents, however, tend to have much more variable document structure, and typically nest tags at least five or six levels deep. For example, an XML schema like DocBook or TEI might have citations in sections, subsections, bibliographies, or sometimes within italics tags, or in blockquotes, and so on. Finding every <citation> element would require a cumbersome (probably recursive) search across levels. Or using XPath, you could just write:

Listing 11. Using XPath to find deeply nested subelements


>>> from elementtree import ElementTree
>>> weblog = ElementTree.parse('weblog.xml').getroot()
>>> cites = weblog.findall('.//citation')

However, XPath support in ElementTree is limited: You cannot use the various functions contained in full XPath, nor can you search on attributes. In what it does, though, the XPath subset in ElementTree greatly aids readability and expressiveness.

I want to mention one more quirk of ElementTree before I wrap up. XML documents can be mixed content. Prose-oriented XML, in particular, tends to intersperse PCDATA and tags rather freely. But where exactly should you store the text that comes between child nodes? Since an ElementTree Element instance has a single .text attribute -- which contains a string -- that does not really leave space for a broken sequence of strings. The solution ElementTree adopts is to give each node a .tail attribute, which contains all the text after a closing tag but before the next element begins or the parent element is closed. For example:

Listing 12. PCDATA stored in node.tail attribute



>>> xml = '<a>begin<b>inside</b>middle<c>inside</c>end</a>'
>>> open('doc.xml','w').write(xml)
>>> doc = ElementTree.parse('doc.xml').getroot()

>>> doc.text, doc.tail
('begin', None)
>>> doc.find('b').text, doc.find('b').tail
('inside', 'middle')
>>> doc.find('c').text, doc.find('c').tail
('inside', 'end')

View Process XML in Python with ElementTree Discussion

Page: 1 2 3 4 5 Next Page: Conclusion

First published by IBM developerWorks