TheArchivist
Copyright 2001-2004, Brian Johnson, Department of Architecture, University of Washington, Seattle
What it does
This script acts as a web crawler, or spider. Starting with a particular URL it retrieves the web page, scans it for links, and then attempts to retrieve all files linked to the page. This behavior repeats for each file retrieved and continues until one of several stop criteria is reached.
Stopping tests:
- Number of hops (links) from the start page.
- Links go "up" rather than "down".
- Links go to disallowed servers.
If desired, the application will rewrite absolute URLs relative to the download hierarchy, producing a completely self-sufficient archive.
[Please Note At this time files referred to by plug-ins (e.g., Flash), Java and some Javascripts may not be identified. You should review the archive carefully for completeness when TheArchivist is done.]
Easy to use!
TheArchivist should be quite easy to install and use in a variety of ways on your desktop Mac. Just download and unStuff the archive. On OS-9, drop the Dialog Director (0.7) and Tanaka's (1.3) OSAXen in your System Folder if you don't already have them. Finally, double-click TheArchivist to start.
Change History
- January 8, 2005 - version 2.1x (b10)
Corrected parsing of javascript function references, added php and asp to recognized "html" file extensions, fixed bug that truncated crawls done without the "legal servers" restriction.
- February 10, 2004 - version 2.1x
Re-written for OS-X. Better speed, better crawl status display, better handling of some legal-but-oddly-formed URLs.
- January 8, 2003 - version 2.1
Fixed problems when processing complex relative links, and a problem with quotes.
- April 20, 2002 - version 2.0
Improved handling of Javascript references and better archive log formatting. New PDF user document.
- March 21, 2001 - version 1.6
Corrects handling of urls containing single quotes. Provides dialog box for specifying default file name for walkable servers. Fixes bug when 'localizing' very short urls. Displays status info while processing in foreground.
- January, 2001 - version 1.5
Initial release.
License
This software is provided as "postcardware". You may download and use the software without cost so long as you register your use. You may register by email to brj@u.washington.edu. You may not redistribute this software, nor include it in any collection. All installations should be made using a fresh download from http://www.caup.washington.edu/software/.
This software is provided as-is. By downloading and installing it you indicate that you accept all risks associated with use of the software and agree not to hold the author or the University of Washington liable in any fashion.
Download
Last updated: January 8, 2005