To future historians—not just of computing, but of humanity—the current period will be a dark age.

How was Facebook used by students in the 2010s? We cannot show you, that version of Facebook is not hosted anywhere.

How did MySpace look around 2009? We don't really know, the Wayback Machine only shows a limited amount of static content, and there may only be a few surviving screenshots

What correspondence did Vint Cerf have as president of the ACM with other luminaries of computing industry and research? We do not know; Google will not publish his emails.

What was it like playing Angry Birds on an iPhone 3G? We do not know; Apple is no longer distributing signed receipts for that binary.

What did the British cabinet discuss when they first learned of the Coronavirus pandemic? We do not know; they chatted on a private WhatsApp group.

What books were published analysing the aftermath of the Maidan coup in Ukraine? We do not know; we do not have the keys for the Digital Editions DRM.

How was the coup covered in televised news? We do not know; the broadcasters used RealVideo and Windows Media Encoder and we cannot read those files.

We have to ask ourselves how we are going to preserve and transmit knowledge about our age to the next generations. Knowledge about an age where information is produced, consumed and discarded within hours, days or months, or where it's only stored on the server rooms of a handful of corporations, with no guarantees that those businesses will exist in the future, and with no way of accessing that information unless a certain set of regulatory, hardware, software pre-conditions are met.

That's why projects like the Internet Archive deserve more recognition and funding. That's why web scraping should not only be a civic right, but a civic duty to the next generations. Otherwise all the knowledge about the great age of information will be transmitted orally - with all the distortions that such transmission implies.

@blacklight This seems overly pessimistic. Even if 99% of the data is lost, I suspect they will have lots of data to work with from second-hand sources, including reviews, screenshots, video walkthroughs and the like. It's not quite the same, but still an abundance compared to what historians have to work with from previous eras?

@skybrian @blacklight I agree. What is also often said about police investigations likely also applies here. Nowadays we have a *lot* of more data available in general. Tons of additional data is produced etc.

I agree on the mentioned issues though, because how long it is and can be stored is a good question etc. and proprietary formats and closed platforms limit that possibility strongly, yeah…

@skybrian you are assuming that the web services that serve screenshots, reviews and "indirect" content today will still be available in a few decades, that the content that they serve will still be reachable at the same URLs, or at least it will still be stored on some hardware that will be easy to access in a few decades (think how hard it is today to find a floppy drive, and think of a puzzled historian in 100 years dealing with a SATA hard drive), and that there will still be clients that can render that content properly (think of how much Flash content from 10 years ago is unprocessable on today's computers, and think how much will be unprocessable in 50 or 100 years).

All in all, tomorrow's historians may still have to access content from our age, but their job will be much, much more difficult than today's.

@blacklight taking screenshots as an act of resistance to protect perishable sources

@KAR0LINGUS screenshots are still hard to parse and search. My approach is to scrape to text and store on my server any interesting webpage I bump into, and share the scraped link instead of the original one behind paywall/metered access/authentication.

@blacklight printing screenshots or analyzing them in papers and books is a more effective strategy in the long run. you are doing a great job btw and historians must thank you but still paper is less perishable than any server-run archive.

@KAR0LINGUS @blacklight a PDF may be easier searchable or, well, and so on.

@blacklight Recently I have been thinking about this often, particularly in terms of things like communications in governments but also archiving personal communications across platforms. There’s so much that’s easily lost if active steps are not taken to preserve it - much more planning is needed than when communications were on paper that could be put aside and would generally persevere. Can preservation be better built into our digital lives? Should it be? On the other hand storing all the data costs huge amounts in resources.

@kat we actually *need* to make digital preservation expensive for information gatekeepers, because preserving information is an expensive but very needed process, and Big Tech has so far ignored their responsibilities in this regard.

Paper archives also cost a lot to maintain and preserve, and when we started this digital revolution we naively thought that archiving and preserving information was a thing of the past. Now we just need to take those costs back into account.

In an ideal world with ideal trade-offs, information of public interest (may it be a news article, a video, a piece of music, or correspondence between public figures) should be allowed to be stored on private platforms for a limited amount of time (e.g. from 5 to 25 years). During that time, the platform or the author have the right to profit from that content and decide how it should be accessed, as long as 1. they don't erect barriers against researchers (information should *always* be transparent to scientists and historians), and 2. with the right of profiting, it also comes the duty of preserving: if you don't allow exporting your content in a downloadable static format, and you don't allow anybody else to scrape or copy your content either, then you need to make sure that information remains accessible, URLs don't break, a disaster in a data center doesn't destroy all the information, etc.

All these things should be explicitly codified into law. We need to make sure that information gatekeepers also understand their responsibilities. The additional burden of data preservation required if you are the only way to access the information may hopefully push some of them to release more data publicly. The law should also enforce businesses that go burst to ensure that data of public interest is not lost once their shop is closed.

After the "profitability and ownership period", all content should be made publicly accessible in a static format for archiving purposes. Appropriate storage media should be used for the backup (i.e. no magnetic disks, optical medium is the best because it can be preserved for 100-200 years), and all the static content should be publicly accessible on the Internet Archive. The Github's Arctic Vault project a couple of years ago raised some awareness on the importance of preserving today's source code, and it was also a nice marketing stunt for the company, but things like these should be the norm, not the exception.

The music industry has similar mechanisms. Authors and copyright holders and their heirs are entitled to benefit from the revenue from their creations for about 50 years (this should definitely be shorter in the case of the IT industry). During that time, record labels have to ensure that the information is preserved - i.e. the master files or tapes are properly stored and accessible. Any damage that makes the original information no longer available can lead to the business being sued (example: After that time, art is considered of public domain, and copying and reproducing it does not incur into any sanctions.

I'm wondering why the Internet industry, which is far more valuable and large than the music industry, doesn't have any such legal framework yet.

Until we have such a legal framework in place, I try to play my part by scraping the web into Markdown one URL at the time, and donating to the Internet Archive.

@blacklight even with all that effort, i wonder how much of our culture would survive a major disruption to the global supply chain.

i'm not aware of any accessible way to store even a gigabyte of data for 50 years without active intervention.

punched card has a suprisingly long lifetime, and the advantage of being decodable without a computer, but has absolutely dismal data density.

pressed CDs can hold a significant amount of data quite stably, but are only really viable for mass produced media like music.

of course, most of this is theoretical, as these technologies haven't really been around long enough to be tested in this aspect.

@binarycat so far optical discs are actually the most viable way. If well preserved they should last at least two centuries.

The best idea is probably something like what Github has done - convert all the stored source code into optical QR-like format, print it on film, and store it in a giant vault in a giant cave well above the polar circle. But it's obviously also the most expensive, especially if done at a greater scale than "just" a one-shot backup of the Github's repos.

I do agree with the online services part, but

> What was it like playing Angry Birds on an iPhone 3G? We do not know; Apple is no longer distributing signed receipts for that binary.

The iPhone 3G is very jailbreakable.

> the broadcasters used RealVideo and Windows Media Encoder and we cannot read those files.

It's not like ffmpeg is going to poof out of existence in the foreseeable eternity.


> The iPhone 3G is very jailbreakable.

Then let's just hope that future historians will also manage to get their hands on a guide on how to jailbreak an iPhone 3G.

> It's not like ffmpeg is going to poof out of existence in the foreseeable eternity.

Long live ffmpeg, and may none of its codecs be lost by takedown requests.

@blacklight @grishka

Emulation comes to mind. We are able to emulate computers that were simply impossible to emulate when they came out.

@yuliyan @grishka that assumes that the software (either in source code or binary image format) and/or the hardware specs remain available.

And emulation would only solve one part of the problem: in the case of online services the data may simply no longer be there or be accessible, no matter if you manage to emulate the exact LAMP stack of Facebook in 2010.

In case of FB: Static HTML copies is all there is. THey are very robust since they only contain the rendered state of the page. I still have HTML states of my fb timelone from 2012 which I can open today. Fun stuff.

@blacklight Print > digital. That's the best takeaway I have here.

Also, it's not like digital sources are all we have. You mentioned oral transmission, but oral history is a thing, so with proper methodologies that's not a big issue, in terms of reliability.

And yes, there's still print sources like newspapers and, well, government documents are still printed physically anyway. Especially true in much of the developing world.

@adgaps we can't possibly print everything, especially when it comes to audio and video content. Plus, if we wanted to print the whole internet we probably wouldn't have enough trees on earth to make paper from. And, even if we put our best efforts into it, we'd still need somebody to archive all this paper in a searchable format (and archivists and librarians have been jobs in decline for the past couple of decades).

I think that the solution has to be digital in order to be scalable and environmentally sustainable, but there's almost nobody out there thinking of how digital content should be archived in such a way that it will still be available in 50 or 100 years.

Oral transmission is surely an option, but it's sad to me to think that we've achieved so much in transmitting information from one end of the world to another within a blink of an eye, just for all that digital information to be eventually lost on the data center of a private business that will eventually shut down and that it's not obliged to preserve that content, and eventually we'll have to mostly resort to oral transmission (which comes with a lot of constraints and distortions as tales mutate from one narrator to the next), just like we did before Mesopotamia's people figured out how to write thousands of years ago.

@blacklight yes, that has been a lot on my mind lately. It's viable to rewrite every history and fact and control populaces with manufactured narratives as wel. Information should be protected and decentralized.

@blacklight You are so right about that! I never thought of it. 😢

@blacklight Indeed, we live through a highly volatile age. Great article!

@blacklight How about burn bodies, not use graves? People of future will not no about today's people. Did Rockefeller, Hasidic Jews, Kolomoiksy, Musk, Soros, Rothschild, Gates and other richest criminals will burn own bodies and bodies of their families? I do not think so.

@blacklight Also, let;s look at what people own. Money? No. It is a piece of plastic and web interface to digital assets that do not exist at all.
Video? Music? Pictures? No. It is on server of the corporation.
Private files in cloud servers? They have right to block you to access your property on their servers.
Lands and resources? No. They own lands and resources in almost every country in the world. And we must pay them. Their companies hire us. Is it freedom? No. It is same slavery.

@blacklight it's not just big tech and social network silos.

In the 80's/90's, before the internet hit the homes there was an entire culture around BBSes - self hosted instances that you'd directly dial to with your modem that hosted files, chat, a form of email, articles, games, etc'. Some were networked to each other, many were isolated.

The culture and stories of this subculture is now almost gone except for some documentaries and ANSI art archives. Most people are not even aware it ever existed.

This one isn't due to any big company, we the users who both ran and used the systems just didn't think to archive any of it until it was already gone. Thankfully there's still some logs and remains of systems we could set up for later research, but not much.

Sign in to participate in the conversation

A platform about automation, open-source, software development, data science, science and tech.