Semi-Structured Data and the Web – Day 5

It’s been an enlightening week on the homework front.

Having had some experience with XML before, I know how easy it is to mess up writing XML, particularly if you’re doing it by hand. Nesting wrong here, a tag misspelled there… although XML is, technically speaking, ‘human readable’, it’s not exactly human-friendly. It’s extremely precise, tends to be very verbose, and has newlines, tabs and other whitespace mixed in which tickles the ol’ natural human intuition about structure but is structurally meaningless to the machine. (Google ‘xml human readable’ for loads of articles on the subject.)

That’s why there’s a load of tooling that helps to format these streams of structured data into something that aligns the machine and human understanding of XML documents. oXygen, XMLSpy, plugins by the bucketload for any editor you can think of. We’ve been using oXygen and it does the job. I haven’t really used anything else though, so if you’re looking for a recommendation, you may want to look elsewhere. Beware though, the comparisons I’ve found date back to 2004 and I can say they’re just plain wrong for oXygen these days. Maybe the best plan is to try the free trials…

So anyway, the homework link here is that we started using Schematron, which can test aspects of the content of documents. DTD, XMLSchema and RelaxNG are all different approaches to validating the content model of a document; that elements contain the correct children and have the correct attributes, with different degrees of expressiveness in each. Schematron doesn’t do that; it allows you to specify arbitrary rules for the content model. You want to make sure that element X is present if element Y is present? Or that a date is is the past? Or that a number of elements declaring percentage values add up to 100%? If you don’t want to write code to to that yourself, you want to get to know Schematron.

The other thing we were looking at is something called XSugar. This thing blew me away. You can declaratively convert from pretty much anything, to XML, and back again, without writing code. In doing so, the tool can analyze the ambiguity and reversibility of the transformation. It’s a lot more interesting than it sounds, citing the project examples page to show a conversion from an XML format to a more human-friendly syntax:
Command line, converting a human-readable format into XML:

java -jar xsugar-all.jar students.xsg students.txt


  
    John Doe
    john_doe@notmail.org
  
  
    Jane Dow
    dow@bmail.org
  

Command line, converting XML above into human-readable format:

java -jar xsugar-all.jar -r students.xsg students.xml

John Doe (john_doe@notmail.org) 19701234
Jane Dow (dow@bmail.org) 19785678

Which version would you rather read and write? (if you’d rather write the XML, I think it’s fair to say that your love of typing is quite unusual) Using a solution like this, it’s pretty easy to convert from a nice human-readable format to an XML format, with confidence that the data payload is safely moved in the transform. Remember too, that you could effectively bring well-formedness and schema languages to bear on your non-XML data through this tool. You just create the .xsg file (kind of like an EBNF grammar, except with dual rules, one for to-XML, the other for from-XML, for each production and it’s easy enough that even I could do it…) and off you go.

It’d be perfectly possible to do this in code, but using XSugar, you just need to create a document that defines the transformation between the two formats. It’s always good if someone else writes (tests, documents, evangelises) the code!

So the homework was pretty neat. Today’s lectures were wrapping up the material on validation algorithms, and looking at some applications that make use of XML, like OWL/XML.

We also have one last set of homework assignments, including a free assignment to re-submit any previous assignment. That’s a little unusual, and has been done because we’ve been using an online learning solution called Blackboard. When I say ‘using’, I mean ‘being hindered by’. Apparently, Blackboard and its administrative issues is largely the reason that we’re on week 5 and have no marks back for any of the dozen or so homework assignments and assessments. It’s not that great to use from the student’s point of view either. No more on that today, I might post up more about what it is and what it got right and wrong from my studenty point of view some other time.

Other homework assignments: Perform a transform using XSLT, and manually execute a run of a tree against a tree grammar.

I’ll be using my resubmit to get even with the XQuery assignment that defeated me in week 3.

This time it’s personal.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s