XSL to get text from Apple Pages documents

Pages is the name of Apple’s basic word processor program that comes with their iWork suite of applications. It’s not a bad program, but a number of months ago I needed to switch up to MS Word for the Mac.

Well, this morning I was looking through some old files and found a text document I wanted to print that I had done using Pages. Unfortunately, I had removed iWork from my Mac, so I no longer had the software to open the Pages document.

After a cursory search on the Internet for a program that would let me open Pages docs without having the program itself, I came up empty-handed.

So, I inspected the Pages document and realized it was a package. (Right click on the document icon and Show Package Contents.) The package contained an index.xml.gz file, which I unzipped and found within the body of my document amidst a whole bunch of XML code.

I momentarily considered reconstructing the text in TextWrangler, but thought it might be fun to write an XSLT file to do the work.

Please note that this is a 1st draft meant to retrieve the text from my document. It will not handle anything fancy, just text. Plus, it will only try to make each chunk of text into a plain-text paragraph in HTML, suitable for copying and pasting out of a browser window. Use at your own risk. 🙂

Ok, here’s the textFromPages.xsl file.

Others may take this initial XSL file and do what they will with it. I hope that if you take this and make it better, you’ll comment on this post to let me (and others) know.

To have it be useful to you, you’ll need to know how to apply an XSL transformation to a source XML file (specifically the index.xml from Pages).

Hint: Firefox will do the transformation for you if you include the proper xml-stylesheet directive right after the XML prologue in the source XML file. It looks like this: <?xml-stylesheet href="textFromPages.xsl" type="text/xsl" ?>

XML file of shooting ranges in Michigan

As another small step in this process of manipulating a data set to upload to Google Maps, I took the cleaned XHTML I had from a few days ago, and used TextWrangler to do some quick search and replaces on the source code in order to produce this XML file.
ranges-data.xml

Next, I think, I’ll load this XML file into PHP using the simplexml features which will make it easy to run the data through a PHP-based GeoCoding processor that I’m sure I can dig up. The goal is to transcode the addresses of the ranges into latitude/longitude points, which seem to be required pieces of data for the KML file I’m trying to piece together.

I may at the same time output the whole thing into KML format, since I’ll be in there with the data nodes anyway.

Sample KML structure for the shooting ranges data

And here is a sample of what the intended shooting ranges KML feed will look like.

And here is a sample of what the intended shooting ranges KML feed will look like.

A couple notes:

  • the Placemark node will repeat for every shooting range
  • I’ll have to find a way to process the address information and generate latitude/longitude points—there are bound to be problems when the GeoCoder will have trouble parsing an address, though I’ve gone through this before on a prior Web development project
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.1">
<Document>
<name>Shooting ranges in Michigan</name>
<description><![CDATA[Places to shoot in Michigan: Public/DNR ranges, shooting clubs, and businesses with firing ranges available.]]></description>

<Placemark>
<name>Flushing Rifle &amp; Pistol Club</name>
<description><![CDATA[165 Industrial Dr., Flushing, MI 48433<br>http://www.flushingrifleandpistol.com/<br>]]></description>
<Point>
<coordinates>-83.866898,43.068909,0.000000</coordinates>
</Point>
</Placemark>

<!-- Repeat Placemark for each range -->

</Document>
</kml>

Clean XHTML of shooting ranges data

My goal is to upload a comprehensive list of shooting ranges to Google Maps (see prior posting). So, to accomplish this, here are the steps I’ve thought of.

My goal is to upload a comprehensive list of shooting ranges to Google Maps (see prior posting).

Why? I just think it would be cool to visualize places to shoot in Michigan.

Plus, once they are in there, I can see next steps, like creating a custom map of just the ranges that host matches for the Central Michigan Rifle and Pistol League shoots.

So, to accomplish this, here are the steps I’ve thought of.

  1. Clean the source code from the NRA page of ranges in Michigan into a valid codebase that can be more easily parsed
  2. Create a prototype of the form that data needs to take to be uploaded to a Google Map (looks like a KML file will do)
  3. Write an XSL document to use to transform the cleaned code (#1) to match the structure for the KML doc (#2)
  4. Run the XSL tranformation and then upload the resulting KML document to Google Maps

Just for the record, here’s the cleaned source code (#1): 2007.12.16-shooting-ranges.html

We need a “credit” attribute in XHTML

The XHTML 2.0 draft document by the W3C includes some promising attributes for elements. For instance, a navigation list could have a role with a value of sitemap. I.e.: <nl role="sitemap">

That’s cool. Think on that a bit, o ye of semantic persuasion. The potential benefits of this type of specificity in standard markup is great.

Now, that said, I was working on a site that I hope to launch tomorrow, and I would have loved to use an attribute like credit for image elements. It would be used to specify photo credits for a couple images I’m using, plus on some banners, I could have credited the designer who put them together.

It would look something like this: <img src="cool.jpg" alt="Illustration of a calico cat in a beret playing the saxophone." credit="J. Smith, Illustrator for Cool Colors, Inc." />

We could throw this information into the alt text, but it doesn’t really belong there, since the alt text is supposed to describe the contents of the image. We could also use the title attribute, but it would be nice to reserve that for slightly more pertinent information.

Today, I just added credit information in as comments in the markup. It was an adequate solution, I think, but will never be picked up by any user-agent.

xsl to tranform xhtml pages

I don’t know why it took me this long to realize this. I’ve been writing xhtml for a couple years now, and around the same time I started playing with xsl stylesheets, but it just occurred to me in a real way that I can probably use xslt to transform my xhtml pages (at my business site, for instance) into forms more useful to other devices. Cell phones and PDAs, for instance.

It probably took me so long because XHTML looks so much like HTML to me, that it didn’t completely sink in that it is truly XML. Yet it is, namespaces and all.

Now that I realize this, I appreciate even more it’s role as an intermediary between html and xml. XHTML doesn’t need xsl to transform it or style it. It is so close to real html that even older browsers can handle it fine, and it works very smoothly with css as is.

So, this realization basically just means that making my site more available on handhelds is even easier than I first thought. Granted, I haven’t gotten into the sticky details of it all yet…