Splitting up XML with XSLT
I recently had to process some XML from a vendor. They had mashed approximately 750 individual entries into a single XML file, and I needed to split it up to make processing easier.
The XML
The XML looked something like this:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article SYSTEM "quiz_syndication_xml.dtd">
<quiz artclid="403777" capturedate="2006-06-30" version="3.0" distribution="367">
<entry tocid="246790">
<category>Science</category>
<question>What is the airspeed velocity of an unladen swallow?</question>
<option>I don't know that!</option>
<option correct="t">11 meters per second, or 24 miles per hour</option>
<option>What do you mean, an African or European swallow?</option>
<link>
<xref refid="1648205" ty="1" artclid="67844" tocid="" reftitle="bird"/>
</link>
</entry>
<entry tocid="123">
…
Strategy
The entries are clearly delimited, so all we need to do is output a new document that has the quiz element, but only one entry element. This isn’t too hard at all; we’ll use a parameter to an XSLT template.
The Template
I saved this as dump_entry.xsl. It accepts a parameter containing the tocid of the entry to dump, and returns a document which suppresses any entries which don’t have that tocid.
It can be run like so:
$ xsltproc --stringparam entry 246790 dump_entry.xsl quizzes.xml | xmllint --format -
We filter the output through XMLLint to remove extra whitespace and make it look nice.
Extracting the TOCIDs
The next step is to get a list of all the TOCIDs from the master document, so we can loop over them and extract them one at a time. We use XSLT for this, too. I saved this as get_tocids.xsl.
This is run like so:
$ xsltproc get_tocids.xsl quizzes.xml
The result is a list of TOCIDs, one per line.
Tying it all Together
So now we can get a list of TOCIDs, and extract a specific TOCID. A dash of shell scripting is all we need now.
$ for tocid in `xsltproc get_tocids.xsl quizzes.xml`; do \ > xsltproc --stringparam entry $tocid dump_entry.xsl quizzes.xml \ > | xmllint --format - > quiz_$tocid.xml; > done
And that’s it.
