Process XML In Python With ElementTree Working with an XML document object

Process XML in Python with ElementTree

By David Mertz, Ph.D. - 2003-12-04 Page: 1 2 3 4 5

Working with an XML document object

A nice thing about ElementTree is that it can be round-tripped. That is, you can read in an XML instance, modify fairly native-feeling data structures, then call the .write() method to re-serialize to well-formed XML. DOM does this, of course, but gnosis.xml.objectify does not. It is not all that difficult to construct a custom output function for gnosis.xml.objectify that produces XML -- but doing so is not automatic. With ElementTree, along with the .write() method of ElementTree instances, individual Element instances can be serialized with the convenience function elementtree.ElementTree.dump(). This lets you write XML fragments from individual object nodes -- including from the root node of the XML instance.

I present a simple task that contrasts the ElementTree and gnosis.xml.objectify APIs. The large weblog.xml document used for benchmark tests contains about 8,500 <entry> elements, each having the same collection of child fields -- a typical arrangement for a data-oriented XML document. In processing this file, one task might be to collect a few fields from each entry, but only if some other fields have particular values (or ranges, or match regexen). Of course, if you really only want to perform this one task, using a streaming API like SAX avoids the need to model the whole document in memory -- but assume that this task is one of several that an application performs on the large data structure. One <entry> element would look something like this:

Listing 3. Sample <entry> element



<entry>
  <host>64.172.22.154</host>
  <referer>-</referer>
  <userAgent>-</userAgent>

  <dateTime>19/Aug/2001:01:46:01</dateTime>
  <reqID>-0500</reqID>
  <reqType>GET</reqType>
  <resource>/</resource>

  <protocol>HTTP/1.1</protocol>
  <statusCode>200</statusCode>
  <byteCount>2131</byteCount>
</entry>

Using gnosis.xml.objectify, I might write a filter-and-extract application as:

Listing 4. Filter-and-extract application (select_hits_xo.py)


from gnosis.xml.objectify import XML_Objectify, EXPAT
weblog = XML_Objectify('weblog.xml',EXPAT).make_instance()
interesting = [entry for entry in weblog.entry
    if entry.host.PCDATA=='209.202.148.31'  
    and 
             entry.statusCode.PCDATA=='200']

for e in interesting:
  print
      "%s (%s)" % (e.resource.PCDATA,
                     e.byteCount.PCDATA)

List comprehensions are quite convenient as data filters. In essence, ElementTree works the same way:

Listing 5. Filter-and-extract application (select_hits_et.py)


from elementtree import ElementTree
weblog = ElementTree.parse('weblog.xml').getroot()
interesting = [entry for entry in weblog.findall('entry')
    if entry.find('host').text=='209.202.148.31'  
             and 
             entry.find('statusCode').text=='200']

for e in interesting:
  print
      "%s (%s)" % (e.findtext('resource'),
                    e.findtext('byteCount'))

Note these differences above. gnosis.xml.objectify attaches subelement nodes directly as attributes of nodes (every node is of a custom class named after the tag name). ElementTree, on the other hand, uses methods of the Element class to find child nodes. The .findall() method returns a list of all matching nodes; .find() returns just the first match; .findtext() returns the text content of a node. If you only want the first match on a gnosis.xml.objectify subelement, you just need to index it -- for example, node.tag[0]. But if there is only one such subelement, you can also refer to it without the explicit indexing.

But in the ElementTree example, you do not really need to find all the <entry> elements explicitly; Element instances behave in a list-like way when iterated over. A point to note is that iteration takes place over all child nodes, whatever tags they may have. In contrast, a gnosis.xml.objectify node has no built-in method to step through all of its subelements. Still, it is easy to construct a one-line children() function (I will include one in future releases). Contrast Listing 6:

Listing 6. ElementTree iteration over node list and specific child type


>>> open('simple.xml','w.').write('''<root>
... <foo>this</foo>
... <bar>that</bar>

... <foo>more</foo></root>''')
>>> from elementtree import ElementTree
>>> root = ElementTree.parse('simple.xml').getroot()
>>> for node in root:
...     print node.text,
...
this that more
>>> for node in root.findall('foo'):
...     print node.text,
...
this more

With Listing 7:

Listing 7. gnosis.xml.objectify lossy iteration over all children


>>> children=lambda o: [x for x in o.__dict__ if x!='__parent__']
>>> from gnosis.xml.objectify import XML_Objectify
>>> root = XML_Objectify('simple.xml').make_instance()
>>> for tag in children(root):
...     for node in getattr(root,tag):
...         print node.PCDATA,
...
this more that
>>> for node in root.foo:
...     print node.PCDATA,
...
this more

As you can see, gnosis.xml.objectify currently discards information about the original order of interspersed <foo> and <bar> elements (it could be remembered in another magic attribute, like .__parent__ is, but no one needed or sent a patch to do this).

ElementTree stores XML attributes in a node attribute called .attrib; the attributes are stored in a dictionary. gnosis.xml.objectify puts the XML attributes directly into node attributes of corresponding name. The style I use tends to flatten the distinction between XML attributes and element contents -- to my mind, that is something for XML, not my native data structure, to worry about. For example:

Listing 8. Differences in access to children and XML attributes


>>> xml = '<root foo="this"><bar>that</bar></root>'

>>> open('attrs.xml','w').write(xml)
>>> et = ElementTree.parse('attrs.xml').getroot()
>>> xo = XML_Objectify('attrs.xml').make_instance()
>>> et.find('bar').text, et.attrib['foo']
('that', 'this')
>>> xo.bar.PCDATA, xo.foo
(u'that', u'this')

gnosis.xml.objectify still makes some distinction in between XML attributes that create node attributes containing text, and XML element contents that create node attributes containing objects (perhaps with subnodes that have .PCDATA).

View Process XML in Python with ElementTree Discussion

Page: 1 2 3 4 5 Next Page: XPaths And tails

First published by IBM developerWorks