XSL to get text from Apple Pages documents


Pages is the name of Apple’s basic word processor program that comes with their iWork suite of applications. It’s not a bad program, but a number of months ago I needed to switch up to MS Word for the Mac.

Well, this morning I was looking through some old files and found a text document I wanted to print that I had done using Pages. Unfortunately, I had removed iWork from my Mac, so I no longer had the software to open the Pages document.

After a cursory search on the Internet for a program that would let me open Pages docs without having the program itself, I came up empty-handed.

So, I inspected the Pages document and realized it was a package. (Right click on the document icon and Show Package Contents.) The package contained an index.xml.gz file, which I unzipped and found within the body of my document amidst a whole bunch of XML code.

I momentarily considered reconstructing the text in TextWrangler, but thought it might be fun to write an XSLT file to do the work.

Please note that this is a 1st draft meant to retrieve the text from my document. It will not handle anything fancy, just text. Plus, it will only try to make each chunk of text into a plain-text paragraph in HTML, suitable for copying and pasting out of a browser window. Use at your own risk. 🙂

Ok, here’s the textFromPages.xsl file.

Others may take this initial XSL file and do what they will with it. I hope that if you take this and make it better, you’ll comment on this post to let me (and others) know.

To have it be useful to you, you’ll need to know how to apply an XSL transformation to a source XML file (specifically the index.xml from Pages).

Hint: Firefox will do the transformation for you if you include the proper xml-stylesheet directive right after the XML prologue in the source XML file. It looks like this: <?xml-stylesheet href="textFromPages.xsl" type="text/xsl" ?>


One response to “XSL to get text from Apple Pages documents”

Leave a Reply

Your email address will not be published. Required fields are marked *