Stranded
We're getting buried in data here. No, I'm not talking about my work, I'm talking about everyone, everywhere. According to a recent article in the Atlantic Monthly (they've got a nice, thought-provoking tech column every month) the Internet collectively produces the equivalent of 1 Library of Congress worth of information every 15 minutes.
Now, of course, the vast bulk of that is pure junk. But that's not the point. The point here is that even as individuals, we're getting buried in our own computer files, from digital photos to ripped MP3s of old concerts, to word processor documents. How to keep it all safe? How to keep it all organized?
I was recently talking with my mother and my sister (both non-tech people) about their digital photos. Did they print them all off and put them in photo albums? No. Did they back them up in any way? Well, my sister dumps hers to a CD occasionally. But mostly, they just leave the photos floating around on their hard drives. I'm sure this is the case for millions of others out there. Even as digital photos spread virally through sites like Flickr and through simple emails, they'll be lost by the millions as hard drives crash or as that old directory with all those photos from 20 years ago is forgotten.
We haven't yet come to grips, as a society, with what it means to store so much of our vital stuff as zeros and ones on disks and flash cards here and there. Such stuff isn't really real, is it? It's easily lost, or even worse, stranded.
Think of your grandfather's slide projector. Remember sitting there, watching as oddly elongated and out of focus images of birthday parties and vacations to Mt. Rushmore were projected onto a bedsheet tacked to the wall? Who has a slide projector now? There are millions of boxes of, probably, billions of slides sitting in attics all around the world that no one has a means of viewing.
Except that they actually CAN be viewed. Sure, no one has an old-fashioned slide projector anymore, but you can hold a slide up to the light and see it reasonably well. You can still get slides printed as photographs if you want. Even if this technology goes away completely, a slide is still a real object with a real image imprinted on it. Some clever mechanic could build a slide projector with a box, a couple of lenses and a light bulb (or even a candle if it comes to that).
But what of those digital photos on the CD or the DVD or the hard drive. If computers go away completely, no on will ever be able to extract your vacation photos of yore.
Its unlikely that computers will go away altogether. Data stored and backed up on remote hard drives is probably as safe as anything for quite some time. But even if the data is out there, will you be able to read it?
I've begun to think about this in terms of my own personal writing. Like every other writer in the world, I use a word processor. But as I think about it, this is a risky thing to do. If, 20 or 30 years from now I decide to go back and work on a story I started yesterday, would I be able to even view the electronic file? Will word processors 30 years from now be backward compatable with MS Word 2003? Unlikely!
So here's what I'm thinking. Certain electronic formats are so raw and basic that they are unlikely ever to change. I'm talking about the WAVE file, the JPEG, the bitmap, and the ASCII text file.
How crazy would I be if I started doing all of my writing in a plain text editor? Actually, for almost all purposes, a modern word processor is overkill. Your story (or school paper) does not need a variable width font, bolded text, and 30 point headings. You can do everything you really need (write words, make paragraph breaks, save your work) in a text editor. Well, OK, spellcheck might come in handy.
If I took this approach, I'd have other advantages too. I use source control for all my writing. If I saved everything as plain text, I'd be able to do diffs on my writing files! Imagine how handy that would be.
I seem to have drifted into the subject of writing now, so I'll wrap it up. Someone tell me, am I crazy? How worried do we need to be about our data being stranded by shifting tides of technology?
The Silver Lining
Yes, there may one day be vast digital graveyards of homeless, sourceless data out there in the world. But at least it won't clog up landfills or get dumped in lakes or something....
In all seriousness, your point is well-taken, Rob. I'd even go a step further to opine that even those hardy binary formats will fall, w/in a few decades, before the mighty quantum computing juggernaut. The nanobots will decide for themselves how to handled the bits and bytes -- we'll just have to instruct them to keep it handy for us.
But it is somewhat distressing to be part of this generation that's watching the physical world fade into the digital one.
Are Word Processors "Stupid"
Well, here's someone who thinks so.
In the article, "Word Processors: Stupid and Inefficient", a fellow named Allin Cottrell posits that word processors cause problems because they distract the primary task (writing) with a secondary and thoroughly superficial task (typesetting). He's got an excellent point and the article is worth a read.
This guy advocates using a text editor (he recommends Emacs) for all computer-based writing. He also suggests embedding TeX annotations in your text for any typesetting needs, which to me still does not divorce typesetting and writing, just makes it more complex to perform, but at least this approach results in ASCII (or Unicode) only output.
Still, the more I think about it, Allin and I are on to something here. Your average school paper or letter to your grandmother needs advanced typesetting features about as much as your family mini-van needs a NOX injector. But this sort of wasteful use of technology is a pervasive problem in the land of computers. Somehow a nifty new thing becomes the absolutely baseline of what's acceptable. Try circulating a document in 12 pt Courier New at your company and you'll see what I mean.
In the case of word processing, though, there are serious consequences, namely, having your documents either stranded and unusable in the future, or unreadable by people you might want to share the document with. The latter is, in fact, quite common. If you're using a more recent version of Word than I am, I can't work on your documents unless you do a backward-compatible save.
Oddly, Visio gets around this. There is, in fact, a text/xml only file format (the .vdx extension) that you can use for flowchart files. (I'm not sure if this is compatable with other systems.) Why not have something similar for word processing files?
What I envision is a simple text file in two sections. Section one would be the raw Unicode text, including tabs and paragraph breaks. Section two would be an XML descriptor of all of the formatting. If, somehow, you got stranded or you did not have software capable of interpreting part 2 of the file, nothing would stop you from opening the file and reading all of the contents in part 1. And after all, all of the communication happens in the words on the page. Fancy formatting is nothing more than bells and whistles.
Anyway, I'm going to download Emacs and see how I like it as a writing editor. I've already tried JEdit for this purpose, and it has one feature which makes its use for writing impossible--namely that if you try to insert a tab at the beginning of a line (paragraph indention, in other words) and you have word-wrap on, the entire paragraph gets indented instead of just the first line).
If anyone would like to suggest a text editor that would make a nice word processor substitute, your input would be appreciated.
Ways Not to Get Stranded
Actually, I've discovered that the free word processor Abiword uses a plain text/xml file format. They don't quite do it exactly as I asked--the text and formatting are mixed together instead of the text being all together at the top. Still, this is a great stride forward in not leaving you stranded.
Also, I took a look at some MS Word .doc files using a text editor. After a bunch of crazy binary gunk you can actually view the text of your document in plain text, albeit in a crazy format. Every letter is followed by a space, and there are odd binary characters mixed in to replace your paragraph breaks and other formatting characters. Still, at least it would be possible if, say, someone discovered a readable disk with bunch of lost stories by some Nobel-Prize winning author, to extract the text from the binary junk. That's something, anyway.
StarOffice gets the booby prize, though. No readable text in the .sxw files at all.
The readable text is there
Take a closer look at the star office files. The readable text has to be there somehow on the basis of pure logic and information theory alone.
I doubt it is encrypted. No, it's "munged". Perhaps high endian bit order is reversed.
It is legal to "decompile" Word and star office formats simply by figuring out the format step by step, while writing the demunger tool. Princeton University's engineering library had, in 1990, a reference book for all extant file formats for major packages in use at the time.
In fact, retro, reconstructive computing will be a growth business as the world goes to hell and gone in my own declining years. This will I predict include the commercial viability of comprehensive simulators for older machines, because these can be useful in reading old file formats.
Indeed, I predict, fancifully and from my vacation in Chiang Mai, Thailand, that a group of Buddhist monks will have accumulated by 2100 AD all electronic resources available, including 100 years of postings to usenet (postings terminated by the Great Power Outage of 2049, followed by the Ten Years of Darkness and Chaos).
A computer based on the old 1960s technology of "fluidics" using water buffalo will operate a printer to print out The Great Text consisting of all that was written between about 1980 and 2049, and volumes of this Text will be stored at a Wat here on the main drag.
Simulators of the Obsolete
Edward wrote:
This will I predict include the commercial viability of comprehensive simulators for older machines, because these can be useful in reading old file formats.
Now that's something I hadn't thought of in relation to this dicussion. Fascinating. It's an indirection layer to solve the file format problem.
At the risk of trivializing, it reminds me of the trend I see of the increasing popularity of "tribute bands" which mimic near exactly the musical experience of a rock band (usually) that no longer exists. With the passage of time, people who missed the original experience, or long for a nostalgic return, people can go see a tribute band and re-experience the Led Zeppelin, for instance, or anyone else you can think of.
The idea that all computers that ever existed will be available in the future as machine simulators is a science-fiction-like vision. I wonder what traditionally strict companies like Microsoft will allow in terms of licensing and copying for people who want to, for instance, develop a simulator for a twenty year old version of Windows that runs an eighteen year old version of Word...
Switching back to the file format part of the discusson, what about RTF as a time-proof file format. It's an open standard and uses plain text (though with embedded with formatting info inline, failing Rob's decoupling test).
Dan
Save as .txt
In the case of Word docs, you can just do "Save As..." and choose .txt which will then bring up a dialog with several options:
1.) Encoding (UTF-8 is an option)
2.) line-feed formats
3.) Allow replace characters (for example, replacing proprietary bullets with asterisks)
Wouldn't that allow you to use the tool at hand and then strip out the formatting as you please? You could use the text files as a content backup and the .doc files to tweak your presentation.
For photos and such, my wife and I burn those to CD once we have a CD's worth and then delete them from the computer. We write the from: and to: dates on the CD and then put them in one of those CD binders. If the .jpg graphic format ever goes extinct we might be in trouble, but that's not going to happen for a LONG time.
Limited Shelf Life of CD-Rs
Maybe things have gotten better with the default, off-the-shelf CD-Rs coming out today, but it used to be the case that if you didn't specifically purchase archive-quality CD-Rs, then the dye starts to degrade after a few years and data loss occurs. I've had this happen already with many early CD-Rs that I bought in the late 90's. Since learning about this, I only buy Mitsui brand, which I believe is guaranteed for 100 years. The Mitsui Gold line is guaranteed for even longer, I think.
Dan
Quantum Computers and .txt
Andy's point (and Edward's amplification) about quantum computers obviating even the binary format of today's data is one well taken. Who knows what the future holds? Quite possibly, 20 years from now a new generation of computers will emerge and everything anyone's done on computers up to this point will be stranded, or individuals will have to go through a migraion process of some sort (which 99.999% will skip--how many of us burned our vinyl to CDs?).
I don't like the "Save as .txt" solution because, as George Kastansa might have said, that's not a system! That's a complete breakdown of the system! It's a goofy manual backup system that requires me to jump through extra hoops by simultaneously saving in two formats. Can't someone else do it?
Actually, I found that you can edit text docs directly in Word. This presents its own problem though. For one, Word continually prompts you to save your file in its preferred .doc format.
Don't trust those CDs for backup, Trevor. It is a pretty danged unstable format. Not that I can think of a better solution other than renting 100 GB of offsite storage and running backups of your hard drive every night.
(Actually, here's neat idea. What about peer-to-peer backups? I back up my data onto your computer, you back up yours onto mine. That way if one of us looses our computer in a catastrophic flood, well, at least the family photos are safe, and at no cost! As it happens, I have a reasonablely easy-to-use backup utility out there called SyncJammer that could be used for this purpose, if FTP server software is running in both locations.)
Basically, what this discussion has shown me is that all of the systems out there for protecting your data are still very ugly and kludgey. Hopefully, once we get used to the fact that our precious things are only stored on unstable hardward, a decent, easy-to-integrate-into-your-life system will emerge. Something that grows out of the return to "the computer is the network" concept that we see reflected in sites like Flickr and the new Google spreadsheet might be promising solutions to this dilemma, if they grow in the right direction and if people get used to the idea that their stuff is actually safer if it "really" lives on a networked computer hundreds of miles away than if it lives on their own personal hard drive.
StarOffice files
The word you're looking for is "compressed". According to the OpenOffice FAQ http://xml.openoffice.org/faq.html#3 it's just zipped XML.


Too true
I have thought about this before now. It was provoked by finding a website I wrote, whilst at high school aged 15, still alive on the web. It embarrasses me but if you must http://chris.gaskell.8m.com/
What is going to happen to all this ‘stagnant’ data? It’ll never be brought up-to-date or saved in new formats. Surely in the long term technologies will move on and applications using this data will be no more.
Should we be trying to make the web backwards compatible?
(At least my animated gifs still spin :-p)
Chris Gaskell, .NET & Web development Enthusiast