Data Extinction: The Problem of Digital Obsolescence

Dinosaur PCB Graphic illustrating Digital ObsolescenceI suspect that many of us, as computer users, have had the experience of searching for some computer file that we know we saved somewhere, but can’t seem to find. Even more frustrating is the situation where, having spent time looking for the file and having found it, we discover either that the file has been corrupted, or is in a format that our software can no longer read. This is perhaps most likely to happen with digital photographs or videos, but it can also happen with text files, or even programs themselves. This is the problem of Digital Obsolescence.

In an earlier post, I mentioned a vector graphics file format called SVG, and I showed how you can use a text editor to open SVG files and view the individual drawing instructions in the file. I didn’t discuss the reason why it’s possible to do that with SVG files, but not with some other file types. For example, if you try to open an older Microsoft Word file (with a .doc extension) with a text editor, all you’ll see are what appear to be reams of apparently random characters. Some file types, such as SVG, are “text encoded”, whereas other types, such as Word .doc files, are “binary encoded”.

Within the computer industry, there has come to be an increasing acceptance of the desirability of using text-encoded file formats for many applications. The reason for this is the recognition of a serious problem, whereby data that has been stored in a particular binary format eventually becomes unreadable because software is no longer available to support that format. In some cases, the specification defining the data structure is no longer available, so the data can no longer be decoded.

The general problem is one of “data retention”, and it has several major aspects:

  • Storing data on physical media that will remain accessible and readable for as long as required,
  • Storing data in formats that will continue to be readable for as long as required.
  • Where files are encrypted or otherwise secured, ensuring that passwords and keys are kept in some separate but secure location where they can be retrieved when necessary.

Most people who have used computers for a few years are aware of the first problem, as storage methods have evolved from magnetic tapes to optical disks, and so on. However, fewer people consider the second and third problems, which is what I want to discuss in this article.

Digital Obsolescence: The Cost of Storage and XML

In the early days of computers, device storage capacities were very low, and the memory itself was expensive. Thus, it was important to make the most efficient use of all available memory. For that reason, binary-encoded files tended to be preferred over text-encoded files, because binary encoding was generally more efficient.

However, those days are over, and immense quantities of memory are available very cheaply. Thus, even if text-encoding is less efficient than binary-encoding, that’s no longer a relevant concern in most cases.

Many modern text-encoding formats (including SVG and XHTML) are based on XML (eXtensible Markup Language). XML provides a basic structure for the creation of “self-describing data”. Such data can have a very wide range of applications, so, to support particular purposes, most XML files use document models, called Document Type Definitions (DTDs) or schemas. Many XML schemas have now been published, including, for example, Microsoft’s WordML, which is the schema that defines the structure of the content of newer Word files (those with a .docx extension).

XML is a huge subject in its own right, and many books have been written about it, even without considering the large number of schemas that have been created for it. I’ll have more to say about aspects of XML in future posts.

Digital Obsolescence: Long Term vs. Short Term Retention

Let’s be clear that the kind of “data retention” I’m talking about here refers to cases where you want to keep your data for the long term, and ensure that your files will still be readable or viewable many years in the future. For example, you may have a large collection of digital family photos, which you’d like your children to be able to view when they have grown up. Similarly, you may have a diary that you’ve been keeping for a long time, and you’ll want to be able to read your diary entries many years from now.

This is a very different problem from short-term data retention, which is a problem commonly faced by businesses. Businesses need to store all kinds of customer and financial information (and are legally required to do so in many cases), but the data only needs to be accessible for a limited period, such as a few years. Much of it becomes outdated very quickly in any case, so very old data is effectively useless.

There are some organizations out there who will be happy to sell you a “solution” to long-term data retention that’s actually useful only for short-term needs, so it’s important to be aware of this distinction.

Digital Obsolescence: Examples from my Personal Experience

In the early “pre Windows” days of DOS computers, several manufacturers created graphical user interfaces that could be launched from DOS. One of these was the “Graphical Environment Manager” (GEM), created by Digital Research. I began using GEM myself, largely because my employer at the time was using it. One facet of GEM was the “GEM Draw” program, which was (by modern standards) a very crude vector drawing program. I produced many diagrams and saved them in files with the .GEM extension.

A few years later, I wanted to reuse one of those GEM drawing files, but I’d switched to Windows, and GEM was neither installed on my computer nor even available to buy. I soon discovered that there was simply no way to open a GEM drawing file, so the content of those files had become “extinct”.

Similarly, during the 1990s, before high-quality digital cameras became available, I took many photographs on 35mm film, but had the negatives copied to Kodak Photo-CDs. The Photo-CD standard provided excellent digital versions of the photos (by contemporary standards), with each image stored in a PCD file in 5 resolutions. Again, years later, when I tried to open a PCD file with a recent version of Corel Draw, I discovered that the PCD format was no longer supported. Fortunately, in this case, I was able to use an older version of Corel Draw to batch-convert every PCD file to another more modern format, so I was able to save all my pictures.

Digital Obsolescence: Obsolete Data vs. Obsolete Media

As mentioned above, the problem I’m describing here doesn’t relate to the obsolescence of the media that contain the files you want to preserve. For example, there can’t be many operational computers still around that have working drive units for 5.25” floppy disks (or even 3.5” floppy disks), but those small disks were never particularly reliable storage media in any case, so presumably anyone who wanted to preserve files would have moved their data to more modern and robust devices anyway.

I’ll discuss some aspects of media obsolescence further in a future post.

Digital Obsolescence: Survival Practices

So what can you do to ensure that your data won’t go extinct? There are several “best practices”, but unfortunately some of these involve some form of tradeoff, whereby you trade data survivability for sophisticated formatting features.

  • Never rely on “cloud” storage for the long term. Cloud storage is very convenient for short-term data retention, or to make data available from multiple locations, but it’s a terrible idea for long-term retention. All kinds of bad things could happen to your data over long periods of time: the company hosting the data could have its servers hacked, or it could go out of business, or else you could simply forget where you stored the data, or the passwords you need to access it!
  • Prefer open data formats to proprietary formats.
  • Prefer XML-based formats to binary formats.
  • Try to avoid saving data in encrypted or password-protected forms. If it must be stored securely, ensure that all passwords and encryption keys exist in written form, and that you’ll be able to access that information when you need it! (That is, ensure that the format of the key storage file won’t itself become extinct.)
  • Expect formats to become obsolete, requiring you to convert files to newer formats every few years.
  • Copy all the files to new media every few years, and try opening some of the copied files when you do this. This reduces the danger that the media will become unreadable, either because of corruption or because physical readers are no longer available.

Sometimes you’ll see recommendations for more drastic formatting restrictions, such as storing text files as plain-text only. Personally, I don’t recommend following such practices, unless the data content is extremely critical, and you can live within the restrictions. If you follow the rules above consistently, you should be relatively safe from “data extinction”.

A Trick of the Light: Exploiting the Limitations of Human Color Perception

Boulton Paul P.111A Aircraft at Baginton Airport
Boulton Paul P.111A Aircraft at Baginton Airport

I snapped the image below many years ago during a visit to the Midlands Air Museum, near Coventry (England). It depicts the one-and-only Boulton-Paul P.111A research aircraft, which, due to its dangerous flight characteristics, was nicknamed the “Yellow Peril”.

I’ve posted this image now not to discuss the aerodynamics of the P.111A, but to consider its color. If you’re looking at the image on a computer monitor (including the screen of your phone, tablet or any similar digital device), you’re presumably seeing the plane’s color as bright yellow.

No Yellow Light Here

That, however, is an illusion, because no yellow light is entering your eyes from this image. What you are actually seeing is a mixture of red and green light, which, thanks to the limitations of the human visual system, fools your brain into thinking that you’re seeing yellow.

The Visible Spectrum

Most of us learn in school that what we see as visible light consists of a limited range of electromagnetic waves, having specific frequencies. Within the frequency range of visible light, most humans can distinguish a continuous spectrum of colors, as shown below.

(The wavelength range is in nanometers; nm)

Visible Spectrum of Light, with Wavelengths

We’re also taught that “white” light does not exist per se, but is instead a mixture of all the colors of light in the visible spectrum.

That’s a significant limitation of the human visual system; we can only see light whose frequency falls within a limited range. There are vast ranges of “colors” of light that aren’t visible to us. Presumably, our visual systems evolved to respond to the frequencies of light that were most useful to our ancestors in their own environment.

How do we see Color?

That leads to the question of how we can determine which light frequencies we’re seeing. Do our eyes contain some kind of detector cell that can measure the frequency of a ray of light? In fact, the system that evolution has bequeathed to us is a little more complex. Our eyes contain several different types of detector cell, each of which responds most strongly to light within a narrow frequency range.

There is one cell type called “rods” (because of their shape) that detect a relatively broad spectrum of light, but which are most sensitive to blue-green light. When the ambient light is low, the rods do most of the work for us, giving us monochrome vision.

There are also three types of cell called “cones”, the detection ranges of which overlap that of the rods. One type of cone is most sensitive to blue light, the second to green light, and the third to red. Given that blue light has the shortest visible wavelength, and red light the longest, the three types of cones are respectively called L (Long = Red), M (Medium = Green), and S (Short = Blue).

The diagram below shows the relative sensitivity of the rods (R) and cones (S, M, L) with respect to wavelength. I’ve also superimposed the color rainbow for convenience.

Sensitivity of Human Retina to visible light spectrum

It’s thanks to the existence of the cone cells that we have color vision. The brain actually combines the information from all the detector cells, and determines from that which color we’re actually looking at.

Fooling the Brain

The nature of our visual system makes it possible to fool our brains into thinking that we’re seeing colors that are not actually present, by combining red, green and blue light in varying intensities.

Technology takes advantage of this limitation to provide what seem to be full-color images (or videos) that use only three color channels: one each for red, green and blue (hence the RGB acronym). Such color systems are known as “Additive Color”.

Conversely, printed color images are created using a three-color (or four-color, if black is added) system that is “Subtractive”. Subtractive systems use Cyan, Magenta, Yellow and optionally Black as their “primary” colors, leading to acronym CMYK. A continuous spectrum of color is achieved by overlaying translucent layers of CMYK in varying proportions. Subtractive color systems are a complex topic in themselves, so I don’t plan to go into further detail about them in this article.

Here’s a comparison of the features of additive versus subtractive color.

Additive Color

In an additive color system, you add colored lights together to create the illusion of a continuous spectrum.

The human brain creates the illusion of a continuous spectrum of colors.

It’s important to realize that, in an additive system, the colors do not somehow combine in space to create a color that’s not there. Instead, our brains combine the responses of the three types of cone in our eyes.

Subtractive Color

In a subtractive color system, you start with white light (or paper), then subtract particular colors from white.

In a printed image, overlaid color dyes block some light wavelengths, so the remaining wavelengths that pass through create the final color.

This is not an “optical illusion” in the same way as additive color.

Viewing a Color Image

Television and computer screens use additive color. The screen you’re viewing is comprised of a large matrix of red, green and blue dots. By varying the intensity of light from each of the dots across the screen, your brain is tricked into thinking that it’s seeing a continuous spectrum of color from blue to red. The size of each dot is so small that your brain merges the light from neighboring dots together.

Are You Looking at a Print, a Slide or a Screen?

So, if you’re reading this article on a computer screen, the image of the Boulton Paul P.111A that you can see isn’t actually shining any yellow light into your eyes, but is using red and green light to fool your brain into thinking that you’re seeing yellow.

Having said all that, my image of the P.111A was originally a Kodachrome 25 color slide. Kodachrome slides create colors using a subtractive process, overlaying translucent layers of secondary colors. Thus, if you could view the original slide, you would indeed see yellow light, created by subtracting blue light from a white source shining through the slide.

If this all seems very complex at first, don’t be deterred, because I know from personal experience that it’s easy to become confused even when you’ve been working with these principles for a long time. When designing digital video equipment years ago, I sometimes found myself forgetting that, in nature, yellow light really is yellow, and is not a mixture of red and green light!