Splitting up XML with XSLT

I recently had to process some XML from a vendor. They had mashed approximately 750 individual entries into a single XML file, and I needed to split it up to make processing easier.

The XML

The XML looked something like this:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article SYSTEM "quiz_syndication_xml.dtd">
<quiz artclid="403777" capturedate="2006-06-30" version="3.0" distribution="367">
  <entry tocid="246790">
    <category>Science</category>
    <question>What is the airspeed velocity of an unladen swallow?</question>
    <option>I don't know that!</option>
    <option correct="t">11 meters per second, or 24 miles per hour</option>
    <option>What do you mean, an African or European swallow?</option>
    <link>
      <xref refid="1648205" ty="1" artclid="67844" tocid="" reftitle="bird"/>
    </link>
  </entry>
  <entry tocid="123">
      …

Strategy

The entries are clearly delimited, so all we need to do is output a new document that has the quiz element, but only one entry element. This isn’t too hard at all; we’ll use a parameter to an XSLT template.

The Template

I saved this as dump_entry.xsl. It accepts a parameter containing the tocid of the entry to dump, and returns a document which suppresses any entries which don’t have that tocid.

It can be run like so:

$ xsltproc --stringparam entry 246790 dump_entry.xsl quizzes.xml | xmllint --format -

We filter the output through XMLLint to remove extra whitespace and make it look nice.

Extracting the TOCIDs

The next step is to get a list of all the TOCIDs from the master document, so we can loop over them and extract them one at a time. We use XSLT for this, too. I saved this as get_tocids.xsl.

This is run like so:

$ xsltproc get_tocids.xsl quizzes.xml

The result is a list of TOCIDs, one per line.

Tying it all Together

So now we can get a list of TOCIDs, and extract a specific TOCID. A dash of shell scripting is all we need now.

$ for tocid in `xsltproc get_tocids.xsl quizzes.xml`; do \
>     xsltproc --stringparam entry $tocid dump_entry.xsl quizzes.xml \
>         | xmllint --format - > quiz_$tocid.xml;
> done

And that’s it.

2006/12/20
Previously On Atomized:

Participate