Long-term Document Archival

By Dick Weisinger

Technology is moving ever faster and faster. And that means change. Just over the last 20 years we’ve seen digital storage media shrink to smaller and smaller physical dimensions while storage capacities have exploded to phenomenal capacities. Just consider that PC World published this week their review of terabyte hard-drives in the sub-$1000 range. If there aren’t already Moore’s law equivalents for storage there should be; one that explains the ever increasing growth of media capacities, and another that explains how people manage to keep pace to consume any new storage capacity that’s created.

But on the black side to the wonder of this ever-improving technology are the many orphaned hardware ancestors to today’s storage units. There is never any backwards compatibility. Whenever you upgrade storage, data needs to be migrated forward and away from the media that has become obsolete.

But not only the media used to store the data, the formats used to represent and assign meaning to the data also follow rapid cycles that lead to obsolescence. Applications rise and fall in popularity. And since most application data formats are proprietary, being able to read and understand data created by an application even five or ten years ago can be a problem.

Saving data files in a format that is non-proprietary and publicly documented is the safest way to go. ASCII or UniCode text files are top choices for safe long-term storage, but realistically, most electronically stored data that needs to be archived for long periods of time is much more complex than what can be represented by just text. XML comes to mind as a possible solution, but while using XML can add some structure to the text, it doesn’t provide any graphical or formating capabilities.

One file format that has become universally accepted is Adobe’s PDF (Portable Document Format). PDF has achieved acceptance in a very short period of time. It’s roots go back to PostScript which has been around since the ’70s, but PDF has only been around since the early ’90s.

Standard PDF files though are not self-contained. There are dependencies of PDF files on fonts and other data that are external to the file. Trying to view these files years later on systems where that external data doesn’t exist any more could mean that the files won’t render correctly.

The Association for Information and Image Management (AIIM) along with the Association for Suppliers of Printing, with direction from Adobe, have pushed through a standard known as ISO 19005-1, also known as PDF/A-1. The standard was approved in late 2005.

PDF/A is based on the PDF version 1.4 specification, the version of PDF that was used in Adobe Acrobat version 5. (Currently Adobe’s Acrobat family of products is on version 7 and version 8 is expected soon. Version 5 is no longer supported by Adobe.)

The /A of PDF/A refers to an archival format. PDF/A contains specification for file self containment, embedded fonts, device-independent color, XMP metadata, and tagging.

The current wide-spread acceptance of PDF and PDF’s flexibility for specifying very rich content plus standardization of the PDF/A seems like a good step forward for achieving a viable long-term document archival strategy.

The problem is that technology keeps changing. PDF has been with us only for about 15 years and the PDF/A standard approved in 2005 is based on technology circa 2000. The standardization committee is already hard at work on upgrading the specification to versions PDF/A-2 and PDF/A-3. Things that are to be added include JPEG 2000 support, digital signatures, 3D graphics, audio, video, … We’ll assume that backwards compatibility will be a top priority for the new versions.

So there is some hope that documents stored with PDF/A will still be readable a century from now.

(Formtek is a partner member of the Adobe Solutions Network.)