Display Microsoft Word documents in a web page as images programmatically

0

This article explains how to display the first page of your Microsoft Word documents (eg .doc, .docx, etc …) as images in a web page programmatically. I have scoured the web for a way to do this without success. The goal is to produce something similar to this list of resume templates: http://office.microsoft.com/en-us/templates/CT010144894.aspx

A working example of this article can be found here: http://www.patsmitty.com/gview/word_image.php
The ENTIRE source code is attached as a .zip file below.

Disclaimer: This article is somewhat advanced and does not cover or explain the HTML, CSS, PHP, and jQuery functions I used here. It details and explains how to get an image for the “src” attribute of an image for your msword documents. If you have any questions, please comment here as I would love to help you answer them.

Please note that this solution is a hack and relies on Google documents. So if the Google documentation changes or disappears … this solution too! Also, this solution is quite complex, so allow about an hour to digest it. All my source files will be attached at the bottom.

This solution therefore starts with the Google docs application. Let’s say you have an msword document at http://www.monserveur.com/test.docx. If you are navigating to http://docs.google.com/gview?url=http://www.myserver.com/test.docx, you will be able to view this document in the “bulky” Google documents viewer. I choose the word ‘bulky’ because all I want to accomplish is to get an image of the document, I’m not interested in zooming in or out or looking at all the pages, etc … So if we look closer to the Google docs app we see that if we right click anywhere in the document we see this context menu:From there, we can see that Google docs is actually generating an image of each page in the msword document. This is exactly what I am looking for! When we select “show image” from this context menu, we see the image with the URL of the image in the address bar. Let’s look at the parameters in the URL:

url – this is the actual url of the msword document

docid – this is a generated id of the image

a – I don’t know what it is, but it still equates to the same (as far as I’ve tested …)

page number – the page number of the msword document (in this tutorial it will always be 1 …)

w – the width of the image in pixels

Here it is! This is the URL that we need to generate programmatically. You might be wondering how we’re going to accomplish this if Google docs randomly generates the docid (we already have the doc url, page number, width, and whatever the ‘a’ is – all we need is the docid). The answer lies in deleting web pages. The rest of this article explains in detail how we’re going to get this URL programmatically for the first few pages of our msword documents.

For the purposes of this article, I have several msword documents in a directory on my website located at http://www.patsmitty.com/gview/word_documents/. It doesn’t matter what you call your documents because we’ll get them programmatically through PHP.

One more thing before we start, my example contains a js prototype progress bar and frame for images, so there are “extra” source files and “foreign” code.

1. Scratch docs.google.com for the docid parameter

Download the PHP Simple HTML DOM parser here: http://sourceforge.net/projects/simplehtmldom/files/simplehtmldom/1.5/simplehtmldom_1_5.zip/download

Create a blank php file and name it “scrapeIt.php”.
Include () or require () the DOM parser
Create a function called getImageUrl which takes 2 parameters: $ file_url, $ thumb_width
This function will contain 2 lines of code that return the URL of the image we need.

function getImageURL($file_url, $thumb_width) {
	$html = file_get_html('http://docs.google.com/gview?url=" . $file_url);
	return doUrl(html_to_URL($html, "{svUrl:"", "46chan"), $thumb_width);
}


Open in a new window

The first line will load the HTML from the Google docs page that displays your msword document as an image. The second line will glean the docid. Unfortunately, we cannot just use jQuery or this parser to retrieve the src from the image as it is programmatically generated by google docs, it will not show up in the source code. But the information we need appears in a JavaScript variable in the source code. It sounds complicated, but then we’ll run some string manipulations on the URL and keep the relevant parts (the docid).
Create the function that the second line of the above snippet refers to: the html_to_URL function which takes $ string, $ start and $ end as parameters. The $ string parameter is the long URL generated by Google docs which contains superfluous elements that we will remove via the positions of the $ start and $ end variables which are “{svUrl: ‘” and “46chan”. These 2 strings which are used in a js function called by Google’s gview application. Below is the code for the js function from Google with the url of my document: