Cleaning Up the PrettyPrint in Python’s xml.dom.minidom

python.jpg

OK... well, this has been an interesting exercise. I've been working on a Python script to grab data out of a system here, and generate an XML file for a vendor upload. It's a big file, and there's a lot of things that have to go into it, but I wanted to be able to do as much as possible from within the Python code itself. If I had to post-process it and add tags later, then that's OK, but I didn't want to start with that plan. I wanted to make the utmost of python's capabilities for this task.

For the most part, I haven't been disappointed. Python has a pretty nice XML DOM system (xml.dom.minidom) built-in, and I was able to create the document very easily. The problem came in dumping it to a file. There were two basic ways I could dump it:

<exchangeTraded><id type="SEDOL">B10RB15</id>
<positionName>1605=JP</positionName><amount>-214.0</amount></exchangeTraded>

where it all appeared on one 'line' and no spaces were to be found, for the 'simple' XML output, and:

<exchangeTraded>
    <id type="SEDOL">
        B10RB15
    </id>
    <name>
        1605=JP
    </name>
    <amount>
        -214.0
    </amount>
</exchangeTraded>

for the 'Pretty Print' output. Neither was really good, but as the lesser of two evils, the second one was far more readable for the size of the file I'd be sending. But I knew that I needed to get back to this as soon as reasonably possible and fix this up with a decent output processor.

Today has been that day. I was quite lucky that I got a good head-start with Google, and was able to put together the following replacement for writexml():

def new_writexml(self, writer, indent="", addindent="", newl=""):
    # indent = current indentation
    # addindent = indentation to add to higher levels
    # newl = newline string
    writer.write(indent + "<" + self.tagName)
 
    attrs = self._get_attributes()
    a_names = attrs.keys()
    a_names.sort()
    # lay down the attributes on the tag
    for a_name in a_names:
        writer.write(" %s=\"" % a_name)
        xml.dom.minidom._write_data(writer, attrs[a_name].value)
        writer.write("\"")
    # now lay down the child nodes
    if self.childNodes:
        if len(self.childNodes) == 1 and \
                self.childNodes[0].nodeType == xml.dom.minidom.Node.TEXT_NODE:
            writer.write(">%s</%s>%s" % (self.childNodes[0].data, self.tagName, \
                        newl))
            return
        writer.write(">%s" % (newl))
        for node in self.childNodes:
            node.writexml(writer, indent + addindent, addindent, newl)
        writer.write("%s</%s>%s" % (indent, self.tagName, newl))
    else:
        writer.write("/>%s" % (newl))

where the arguments are exactly the same as the original version, we've just cleaned up the printing of the child nodes when there's only one and that one is a text field. To install it into the proper place in the runtime, I simply need to:

#
# This is the main working section of the script
#
def main(argv):
    # hook in our new XML DOM writing code
    old_writexml = xml.dom.minidom.Element.writexml
    xml.dom.minidom.Element.writexml = new_writexml

and it's good to go with the same old code.

The results are perfect:

<exchangeTraded>
    <id type="SEDOL">B10RB15</id>
    <name>1605=JP</name>
    <amount>-214.0</amount>
</exchangeTraded>

I'm sure I'll be using this again very soon, and I wanted to document it so I didn't forget it. Really quite simple, but the effects are quite dramatic on a large file with thousands of nodes.