XSL to get text from Apple Pages documents

Pages is the name of Apple’s basic word processor program that comes with their iWork suite of applications. It’s not a bad program, but a number of months ago I needed to switch up to MS Word for the Mac.

Well, this morning I was looking through some old files and found a text document I wanted to print that I had done using Pages. Unfortunately, I had removed iWork from my Mac, so I no longer had the software to open the Pages document.

After a cursory search on the Internet for a program that would let me open Pages docs without having the program itself, I came up empty-handed.

So, I inspected the Pages document and realized it was a package. (Right click on the document icon and Show Package Contents.) The package contained an index.xml.gz file, which I unzipped and found within the body of my document amidst a whole bunch of XML code.

I momentarily considered reconstructing the text in TextWrangler, but thought it might be fun to write an XSLT file to do the work.

Please note that this is a 1st draft meant to retrieve the text from my document. It will not handle anything fancy, just text. Plus, it will only try to make each chunk of text into a plain-text paragraph in HTML, suitable for copying and pasting out of a browser window. Use at your own risk. :-)

Ok, here’s the textFromPages.xsl file.

Others may take this initial XSL file and do what they will with it. I hope that if you take this and make it better, you’ll comment on this post to let me (and others) know.

To have it be useful to you, you’ll need to know how to apply an XSL transformation to a source XML file (specifically the index.xml from Pages).

Hint: Firefox will do the transformation for you if you include the proper xml-stylesheet directive right after the XML prologue in the source XML file. It looks like this: <?xml-stylesheet href="textFromPages.xsl" type="text/xsl" ?>

XML file of shooting ranges in Michigan

As another small step in this process of manipulating a data set to upload to Google Maps, I took the cleaned XHTML I had from a few days ago, and used TextWrangler to do some quick search and replaces on the source code in order to produce this XML file.
ranges-data.xml

Next, I think, I’ll load this XML file into PHP using the simplexml features which will make it easy to run the data through a PHP-based GeoCoding processor that I’m sure I can dig up. The goal is to transcode the addresses of the ranges into latitude/longitude points, which seem to be required pieces of data for the KML file I’m trying to piece together.

I may at the same time output the whole thing into KML format, since I’ll be in there with the data nodes anyway.

Sample KML structure for the shooting ranges data

And here is a sample of what the intended shooting ranges KML feed will look like.

A couple notes:

  • the Placemark node will repeat for every shooting range
  • I’ll have to find a way to process the address information and generate latitude/longitude points—there are bound to be problems when the GeoCoder will have trouble parsing an address, though I’ve gone through this before on a prior Web development project
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.1">
<Document>
<name>Shooting ranges in Michigan</name>
<description><![CDATA[Places to shoot in Michigan: Public/DNR ranges, shooting clubs, and businesses with firing ranges available.]]></description>

<Placemark>
<name>Flushing Rifle &amp; Pistol Club</name>
<description><![CDATA[165 Industrial Dr., Flushing, MI 48433<br>http://www.flushingrifleandpistol.com/<br>]]></description>
<Point>
<coordinates>-83.866898,43.068909,0.000000</coordinates>
</Point>
</Placemark>

<!-- Repeat Placemark for each range -->

</Document>
</kml>