Process XML In Python With ElementTree Some Benchmarks

Process XML in Python with ElementTree

By David Mertz, Ph.D. - 2003-12-04 Page: 1 2 3 4 5

Some Benchmarks

My colleague Uche Ogbuji has written a short article on ElementTree for another publication. One of the tests he ran compared the relative speed and memory consumption of ElementTree to that of DOM. Uche chose to use his own cDomlette for the comparison. Unfortunately, I am unable to install 4Suite 1.0a1 on the Mac OSX machine I use (a workaround is in the works). However, I can use Uche's estimates to guess the likely performance -- he indicates that ElementTree is 30% slower, but 30% more memory-friendly, than cDomlette.

Mostly I was curious how ElementTree compares in speed and memory to gnosis.xml.objectify. I had never actually benchmarked my module very precisely before, since I never had anything concrete to compare it to. I selected two documents that I had used for benchmarking in the past: a 289 KB XML version of Shakespeare's Hamlet and a 3 MB XML Web log. I created scripts that simply parse an XML document into the object models of the various tools, but do not perform any additional manipulation:

Listing 1. Scripts to time XML object models for Python

% cat time_xo.py
import sys
from gnosis.xml.objectify import XML_Objectify,EXPAT
doc = XML_Objectify(sys.stdin,EXPAT).make_instance()
---
% cat time_et.py
import sys

from elementtree import ElementTree
doc = ElementTree.parse(sys.stdin).getroot()
---
% cat time_minidom.py
import sys
from xml.dom import minidom
doc = minidom.parse(sys.stdin)

Creating the program object is quite similar in all three cases, and also with cDomlette. I estimated memory usage by watching the output of top in another window; each test was run three times to make sure that they were consistent, and the median value was used (memory was identical across runs).

Figure 1. Benchmarks of XML object models in Python

One thing that is clear is that xml.minidom quickly becomes quite impractical for moderately large XML documents. The rest stay (fairly) reasonable. gnosis.xml.objectify is the most memory-friendly, but that is not surprising since it does not preserve all the information in the original XML instance (data content is kept, but not all structural information).

I also ran a test of Ruby's REXML, using the following script:

Listing 2. Ruby REXML parsing script (time_rexml.rb)


require "rexml/document"
include REXML
doc = (Document.new File.new ARGV.shift).root

REXML proved about as resource intensive as xml.minidom: parsing Hamlet.xml took 10 seconds and used 14 MB; parsing Weblog.xml took 190 seconds and used 150 MB. Obviously, the choice of programming language usually takes precedence over the comparison of libraries.

View Process XML in Python with ElementTree Discussion

Page: 1 2 3 4 5 Next Page: Working with an XML document object

First published by IBM developerWorks