XSL to get text from Apple Pages documents

Pages is the name of Apple’s basic word processor program that comes with their iWork suite of applications. It’s not a bad program, but a number of months ago I needed to switch up to MS Word for the Mac.

Well, this morning I was looking through some old files and found a text document I wanted to print that I had done using Pages. Unfortunately, I had removed iWork from my Mac, so I no longer had the software to open the Pages document.

After a cursory search on the Internet for a program that would let me open Pages docs without having the program itself, I came up empty-handed.

So, I inspected the Pages document and realized it was a package. (Right click on the document icon and Show Package Contents.) The package contained an index.xml.gz file, which I unzipped and found within the body of my document amidst a whole bunch of XML code.

I momentarily considered reconstructing the text in TextWrangler, but thought it might be fun to write an XSLT file to do the work.

Please note that this is a 1st draft meant to retrieve the text from my document. It will not handle anything fancy, just text. Plus, it will only try to make each chunk of text into a plain-text paragraph in HTML, suitable for copying and pasting out of a browser window. Use at your own risk. :-)

Ok, here’s the textFromPages.xsl file.

Others may take this initial XSL file and do what they will with it. I hope that if you take this and make it better, you’ll comment on this post to let me (and others) know.

To have it be useful to you, you’ll need to know how to apply an XSL transformation to a source XML file (specifically the index.xml from Pages).

Hint: Firefox will do the transformation for you if you include the proper xml-stylesheet directive right after the XML prologue in the source XML file. It looks like this: <?xml-stylesheet href="textFromPages.xsl" type="text/xsl" ?>

HTML form fields that, when not selected, do not even send a field name upon submit

Checkboxes and radio buttons that have not been checked and multiple select lists that have no selection submit nothing upon submission of the form. It’s as though they aren’t even there.

At first, this may seem obvious (Well, yeah, you didn’t select them, dummy!), except that it runs counter to every other form field.

If you have a text field named “surname” and you submit the form with no value in “surname”, the submission still includes the variable name “surname” but it has no corresponding value. You have the key with a null value.

It’s the same with textareas, any other type of input element, and select lists (where you are limited to a single selection). Even named buttons submit their values.

So, the stealthy cuplrits:

  • input type=”radio”
  • input type=”checkbox”
  • select multiple=”multiple”

Adam and I learned this in the midst of discussing and testing code this evening.