Friday, December 29, 2006

XML Schema Parsing

You might have heard that XML Schema is dead. I read that, and the article it links to as proof, a couple of weeks ago, and was not overly impressed. Not that I don't agree that RELAX is generally easier, but the proof that "everybody with a choice picks RELAX" just does not cut it as a logical argument. It is a classic appeal-to-authority logical fallacy, where the authority here is the collective unconscious of the technology world.

Recently I had a surprisingly difficult problem to solve. I had an input document that was very simple. It looked like this:

<doc>
<param1>val1</param1>
<param2>val2</param2>
...
</doc>

Then I had metadata that mapped param1 to an XPath for a target document. For example, param1 -> /foo/schmoo/do/wop/dee , etc. I needed to generate an XQuery that would take my input document and produce the target document with the correct values (val1, val2, etc.) inserted into it.

Seems pretty easy. It gets a little harder because this is also possible for the input document:

<doc>
<param1>val1<param1></param1>
<param1>val1a</param1>
<param1>val1b</param1>
...
</doc>

I still have the same XPath for param1, i.e. something like /foo/schmoo/do/wop/dee. Obviously there's some repeating elements in my target document, but where does the repeating occur? From the above information, you could have a target doc like:

<foo>
<schmoo>
<do>
<wop>
<dee>val1</dee>
<dee>val1a</dee>
<dee>val1b</dee>
...

Or:

<foo>
<schmoo>
<do>
<wop>
<dee>val1</dee>
</wop>
<wop>
<dee>val1a</dee>
</wop>
<wop>
<dee>val1b</dee>
</wop>
...

Obviously there are even more possibilities than this, but you get the point. There's no way to be sure. But I was armed with two more bits of information.

First, and this is big, I knew that there could only be one repeating element in a given XPath. Of course this is not true in the abstract, but for my particular problem it was true. So either wop repeated or dee repeated, but not both. Next, I knew the XML Schema of the target document. So I could figure out if it was wop or dee that repeated. That is enough information to create the XQuery.

Now all I had to do was parse the XML Schema for the target document. No problem, right? There are a lot of Java XML Schema parsers out there. I was already using Sun's JAXB Reference Implementation, and it included Sun's XSON XML Schema Parser. So I decided to give that a try.

XSON is a pretty pure implementation of the XML Schema specification. It's objects correspond directly the ones you see in the spec, as do the relationships between these objects. And there's the problem. The spec is very complex. It has the smell of a class "design by committee" spec.

In a design by committee, everybody has their own ideas on how to do things. So the design winds up being "all-inclusive" so that everyone's ideas are present, and nobody gets their feelings hurt. Such all-inclusive designs are complete Frankensteins. XML Schema is definitely a Frankenstein.

My task was really simple. I knew the name of an element from the XPath. I just needed to get the cardinality of this element. First thing first though, I had to parse the schema. This wound up being trickier than I expected. The schema I needed to parse referenced two other schemas. Here's its references:

<xs:include schemalocation="includedSchema.xsd">
<xs:import namespace="urn:my:namespace" schemalocation="importedSchema.xsd">

The schemas were sitting in the same package as the schema I was parsing. I knew that this was going to be an issue, so I wrote a simple EntityResolver that would use a ClassLoader to load these schemas. Only that did not work. My EntityResolver would not get invoked at all. Instead, XSOM would throw an exception complaining about relative URLs being used with no base URL being set.

This was aggravating. If it would only give the EntityResolver chain a chance, then all would be resolved. But no. So I had to change the include and import statements:

<xs:include schemalocation="http://complete.garbage/includedSchema.xsd">
<xs:import namespace="urn:my:namespace" schemalocation="http://complete.garbage/importedSchema.xsd">

Amazingly this worked. Now it invoked the EntityResolver chain, and my custom resolver loaded the referenced schemas. Here's my parsing code:

EntityResolver resolver = new MyCustomResolver();
XSOMParser parser = new XSOMParser();
parser.setEntityResolver(resolver);
// load the initial schema into an InputStream
parser.parse(myInputStream);
XSSchemaSet schemaSet = parser.getResult();
XSSchema mySchema = parser.getSchema("myNamespace");

Now back to finding the cardinality of the element. To find the root element turned out to be very easy:

XSElementDecl rootElement = mySchema.getElementDecl("root_from_xpath");

That's easy. No need to check the cardinality of the root element, as it cannot be greater than one. It turns out that even if I wanted to check it, I could not. Why? Because in XSD, Elements do not have cardinality. Only Particles have cardinality. A Particle can be an Element, but the relationship is not reflexive, i.e an Element cannot be a Particle.

Unless you are already familiar with XSD, you're probably thinking what I thought: WTF!?!? What is the point of this non-reflexivity? Every Element except the root one is in fact a Particle. So by disallowing a symmetric relationship, you've only made sure that you can't get a Particle for the root Element. That's all you've bought. And what have you suffered? Well instead of finding an Element and getting it's Particle to get its cardinality, you must find a Particle that happens to be the Element that you want, and check that Particle's cardinality. It's like you descend to the lowest level of a tree to check some metadata, then go back up a level to get the other metadata that you need. I did find a nice diagram that shows this relationship, though it doesn't make it look as diabolical as it really is:



So back to the problem. As I descend an XPath, I need to find the Particle in the tree representing to the Element whose name corresponds to the name in the XPath. So I need the ComplexType of the current parent Element. Obviously some recursion is needed, so I defined a XSComplexType variable called current to use for the recursion:

XSComplexType current = rootElement.getType().asComplexType(); // initial value

Then at each step in the XPath, I had a string localName that was the element name:

XSContentType content = current.getContentType();
XSParticle particle = this.findParticleForElement(content.asParticle(), localName);
XSElementDecl element = particle.getTerm().asElementDecl();
if (particle.isRepeated()){
return element.getName();
} else {
current = element.getType().asComplexType();
}

The key is the findParticleForElement method:

private XSParticle findParticleForElement(XSParticle particle, String elementName){
XSTerm term = particle.getTerm();
if (term.isElementDecl() &&
term.asElementDecl().getName().equals(elementName)){
return particle;
} else if (term.isModelGroup()){
for (XSParticle p : term.asModelGroup().getChildren()){
XSParticle temp = findParticleForElement(p, elementName);
if (temp != null){
return temp;
}
}
} else if (term.isModelGroupDecl()){
for (XSParticle p : term.asModelGroupDecl().getModelGroup().getChildren()){
XSParticle temp = findParticleForElement(p, elementName);
if (temp != null){
return temp;
}
}
}
return null;
}

You might wonder why the extra recursion in the find method. Well XSD allows for inheritance, and this is what causes the extra recursion. An element can have a ModelGroup as a child, and the children of that ModelGroup can be ModelGroups themselves. So you could have to descend several ModelGroups before finding the target element, just to find a direct child element.

3 comments:

Unknown said...

What should be the value of localName here?

Anonymous said...

i need to get minoccurs & maxoccurs for a simpletype. i am under the assumption that i need to access the XSParticle. But not sure how to get the XSParticle from XSSimpleType. Any thoughts?

Varadharajan said...

Is it possible for you post the complete code?