Ian Dennis Miller, PhD - Archiving a website for offline viewing

This weekend, I needed to make an offline snapshot of an entire website. My web browser offers some sort of “Save Page As” feature, but that’s not going to cut it for my job. Here are a few of my considerations:

I wanted to grab an archive of a single forum thread, spanning dozens of pages.
The thread refers to images and files hosted on a dozen different servers, so I needed to grab all of the external files too.
The offline archive needs to be a perfect replica of the online one, but it has to be entirely self-contained without online dependencies.
Most importantly, it needs to be really easy to browse the offline website using a normal web browser.

The perfect tool didn’t exist yet, so I wrote it. Introducing: offline-pages.

The tool does exactly what it says: you get web pages offline. It’s the multi-page version of “save page as”. You give it some urls to archive, and it puts all the stuff into a single archive file. Read on for a quick demonstration (taken from the project readme file).

Installation

git clone git://github.com/iandennismiller/offline-pages.git
cd offline-pages
sudo make install

Behind the scenes, make install will use setuptools to install a python library and scripts. Tested on OS X Mountain Lion. Other *NIXes are likely to work as well.

Usage

Let’s say you want to mirror the wikipedia article for “Webarchive”.

1. Create a file containing target URLs

echo http://en.wikimedia.org/wikipedia/en/wiki/Webarchive > urls.txt

2. Use this file as input to the offline-create program:

offline-create ./urls.txt wikipage

3. View the results

offline-browse wikipage.archive.tgz

Conclusion

So there you have it! Let me know if you come up with a cool way to use offline-pages.