oreilly.comSafari Books Online.Conferences.


Processing XML with Xerces and the DOM

by Q Ethan McCallum

From data storage to data exchange and from Perl to Java, it's rare to write software these days and not bump into XML. Adding XML capabilities to a C++ application, though, usually involves coding around a C-based API.

Even the cleanest C API takes some work to wrap in C++, often leaving you to choose between writing your own wrappers (which eats up time) or using third-party wrappers (which means one more dependency). Adopt the Xerces-C++ parser and you can skip these middlemen. This mature, robust toolkit is portable C++ and is available under the flexible Apache Software License (version 2.0).

Xerces' benefits extend beyond its C++ roots. It gives you a choice of SAX and DOM parsers, and supports XML namespaces. It also provides validation by DTD and XML schema, as well as grammar caching for improved performance.

This article uses the context of loading, modifying, and storing an XML config file to demonstrate Xerces-C++'s DOM side. My first example shows some raw code for reading XML. Then I revise it a couple of times to address deficiencies. The last example demonstrates how to modify the XML document and write it back out to disk. Along the way, I've made some helper classes that make using Xerces a little easier. My next article will cover SAX and validation.

I compiled the sample code under Fedora Core 3/x86 using Xerces-C++ 2.6.0 and GCC 3.4.3.

A Quick DOM Primer

The Document Object Model (DOM) is a specification for XML parsing designed with portability in mind. That is, whether you're using Perl or Java or C++, the high-level DOM concepts are the same. This eases the learning curve when moving between DOM toolkits. (Of course, implementations are free to add special features and convenience above and beyond the requirements of the spec.)

DOM represents an XML document as a tree of nodes (Xerces class DOMNode). Consider Figure 1, an XML document of some airport information. DOM sees the entire document as a document node (DOMDocument), the only child of which is the root <airports> element node (DOMElement). Were there any document type declarations or comments at this level, they would also be child nodes of the document node.

the DOM of an XML document
Figure 1. The DOM of an XML document

The <airport> element is a child node of <airports>. Its only attribute, name, is an attribute node (DOMAttr). <airport> children include the <aliases>, <location>, and <comment> elements. <comment> has a child text node (DOMText), which contains the string "Terminal 1 has a very 1970's sci-fi decor."

DOM even makes XML comments available as nodes (DOMComment). The example comment block is another <airports> child node.

There are several other nodes between the elements, too: each chunk of white space (such as that between </location> and <comment>) is its own text node. It's a text node of white space, but it's still a valid node to the DOM.

You can create, change, or remove nodes on this object representation of your document, then write the whole thing--comments included--back to disk as well-formed XML.

DOM requires that the parser load the entire document into memory at once, which can make handling large documents very memory intensive. For small to midsize XML documents, though, DOM offers portable read/modify/write capabilities to structured data when a full relational database (such as PostgreSQL or MySQL) is overkill.

First Look at Xerces Code

I prefer to explain this with source code. I will share some code excerpts inline, but as always, the complete source code for the examples is available for download.

The program step1 represents a portion of a fictitious report viewer. The config file tracks the time of its most recent modification, the user's login and password to the report system, and the last reports the user ran. Here's a sample of the config file:

<config lastupdate="1114600280">

  <login user="some name" password="cleartext" />

    <report tab="1" name="Report One" />
    <report tab="2" name="Report Two" />
    <report tab="3" name="Third Report" />
    <report tab="4" name="Fourth Report" />
    <report tab="5" name="Fifth Report" />


(Xerces also supports XML namespaces, though the sample code doesn't use them.)

The first thing to notice about step1 is the number of #included headers. Xerces has several header files, roughly one per class or concept. Some such projects have one master header file that includes the others. You could write one yourself, but including just the headers you need may speed up your build process.

Most Xerces constructs exist under the xercesc C++ namespace. You're certainly welcome to put using namespace directives in your code; but following good C++ form, the sample code explicitly states the namespace where needed.

main() calls routines to initialize and teardown the Xerces library:


// ... regular program ...


Your code must call Initialize() before using any Xerces classes. In turn, attempts to use Xerces classes after the call to Terminate() will yield a segmentation fault. Initialize() may throw an exception, so I've wrapped it in a try/catch block. Notice the call to XMLString::transcode() in the catch section:

}catch( xercesc::XMLException& e ){

  char* message = xercesc::XMLString::transcode( e.getMessage() ) ;

  std::cerr << "XML toolkit initialization error: "
        << message
        << std::endl

  xercesc::XMLString::release( &message ) ;

Pages: 1, 2, 3, 4

Next Pagearrow

Sponsored by: