Make Digital Preservation Easier

If major companies think it’s too hard or costly to leave up user-generated content, perhaps we need to change the motivations.

Today in Tedium: Whether the source of content is Yahoo! Answers, Geocities, 12-year-old images from Imgur, or a random YouTube channel full of forgotten videos, the easiest-to-remove parts of the internet are often those that aren’t getting noticed. On the one hand, if you go into any library, most books are just sitting on shelves, actively being unused. But on the other, the internet is simply designed to be screwed with. Websites do not stay stationary, encased in amber, and there is significant financial motivation for large companies to only play the hits. After all, it’s why Top 40 radio isn’t all Dishwalla, all the time. With the news that Google is about to implement a rule letting the company remove content from accounts that haven’t been accessed in two years, I’m left to wonder if the problem is that the motivations for maintaining sites built around user-generated content simply do not favor preservation, and never will without outside influence. How can we change that motivation? Today’s Tedium, in the spirit of a 2019 piece we wrote about digital preservation, attempts to see the corporate point of view on mass content removal. — Ernie @ Tedium

Join over 4 million people who read Morning Brew—the daily email covering the latest news across business, finance, and tech. It’s free & only takes 5 minutes to read so that you can get all of the most relevant updates and move on with your day.

“I understand your usage of groups is different from the majority of our users, and we understand your frustration. However, the resources needed to maintain historical content from Yahoo Groups pages is cost-prohibitive, as they’re largely unused.”

A statement sent to an archivist in 2019 as Verizon took steps to shut down the vast majority of the existing Yahoo Groups. (Verizon has since sold most of its stake in Yahoo to Apollo Global Management, an investment firm.) It’s worth keeping in mind that for most large companies, the costs of continuing to host such content would have been relatively minimal—especially in cases where, as in the case of Yahoo, the company is part-owned by a company with its own distribution network.

(Denny Muller/Unsplash)

The problem with corporate motivations is that they aren’t the same as the user’s, even when the user made the content

Whether Google, Verizon, Disney, Nintendo, or Sony, the corporate motivations for keeping content available online for long periods differ greatly from the motivations that drive external visitors.

Users very much have an expectation of permanence just as they did with physical media, but in the context of online distribution, these companies have competing interests driving their decision-making that discourage them from not taking steps to protect historic or vintage content.

And in the case of user-generated content, there might be outside considerations at play. Perhaps they are concerned that something within an old user agreement might come to bite them if they leave a website online past its sell-by date, opening up to liabilities. Perhaps the concern is old, outdated code that may look novel on the outside but is effectively a potential attack surface in the wrong hands. After all, if they’re not keeping an eye on it, who’s to say someone can’t take advantage of that?

(That’s the case Google is making around its move to shut off old accounts, noting that such accounts are significantly less likely to have two-factor authentication turned on, meaning they can be at risk of compromise.)

And then there are reasons that are a little more consumer-hostile. In 2021 Nintendo ended sales for a bunch of old Mario content in both digital and physical form. It evokes the old gating of home video releases that Disney used to do in an effort to keep its old content fresh and make more money from that old content.

When it comes to websites, though, much of that content is user-generated, even if a technology company technically maintains it. I have to imagine that there’s an expectation that a company only has limited capability for maintenance costs, and the motivation for doing so is limited.

But on the other hand, as digital preservationist David Rosenthal has pointed out, in the grand scheme, preservation is not really all that expensive. The Internet Archive has a budget—soup to nuts—of around $20 million or less per year, around half of which goes to pay for the salaries of the staff. And while they don’t get all of it (in part because they can’t!), they cover a significant portion of the entire internet, literally millions of websites. They have a fairly complex infrastructure, with some of its 750 servers online for a decade or longer and petabyte capacity in the hundreds, but given that they are trying to store decades worth of digitized content—including entire websites that were long-ago forgotten—it’s pretty impressive!

So the case that it costs too much to continue to simply publicly host a site that contains years of historically relevant user-generated content is bunk to me. It feels like a way of saying, “we don’t want to shoulder the maintenance costs of this old machine,” as if content generated by users can be upgraded in the same way as a decade-old computer.

One thought I have is that this issue repeatedly comes up because the motivations for corporations naturally lean in favor of closure when the financial motivation has dried up. Legislation could be one way to manage this to sort of right the axis in favor of preservation—but legislation could be difficult to pass. (This was the crux of my case for trying to make the legislation for the National Register of Historic Places apply to websites.)

I would love to see legislation around this issue to favor the preservation of public-facing content where possible, even when liability is a concern. But I’m a realist—a law like that would have many moving parts and may be a tough sell. So, if we can’t encourage a law, maybe we need to build strategies to make maintaining a historic website easier to lift.

2012

The year that the genealogy platform Ancestry.com launched a new site, Newspapers.com, to offer paid archives of newspapers to interested parties. The company, which charges about $150 per year for access to the archive, has helped maintain access to the historic record for researchers who need it. (I’m a subscriber and it is worth it.) With the exception of paid services for Usenet like Giganews, this model has not really been tried for vintage digital-only content, which seems like a major missed opportunity for companies raising concerns about financial costs for maintaining old platforms, like Yahoo/Verizon. Certainly I would prefer it to be free, but if I had to have a choice between free and non-existent, I’d pay money to access old content. Just throwing that out there.

(Ethan Hoover/Unsplash)

A middle ground: An “analog nightlight” mode for websites

In some ways, I think that part of the motivation for taking down old or outdated websites is the expectation that the internal systems must also stay online.

But I think archivists and historians would be more than happy if public-facing content—that is, content that appeared on search engines, or was a part of the main experience when logged in at a basic level—was prioritized and protected in some way, which would at least keep the information alive even if its value was limited.

There’s something of a comparison here that I’d make: When the U.S. dropped the vast majority of its analog signals in favor of digital tuning, it led to something called the “analog nightlight,” in which very minimal, basic information was presented on analog stations was presented during the period before it was turned off. A TV host parlayed basic information to viewers about the transition, and told them what to do next. It didn’t entirely work—TV stations in smaller markets didn’t actually air the analog nightlight—but it helped give a sense of continuity as a new medium found its footing.

This approach, to me, feels like a path forward that could minimize the crushing pain of a loss of historic content while taking away much of the risks that come with continuing to host a site that may no longer be popular in the modern day but still continues to have value in a long-tail sense.

In the case of an “analog nightlight” equivalent for websites, the goal would be to essentially to shut down any sort of attack surface through good design and planning. Before the site is taken offline in its original form, users are given the chance to download their old content or remove it from the website over a period of, say, 60 days. This is not too dissimilar to the warnings that site operators offer when they shut down currently.

But once the deadline is hit, the site operators launch a minimal version of the original platform, with no way to log in or comment. If it’s an account that’s getting shuttered, such as a forgotten YouTube account, it remains in a preservation mode, with no changes allowed. The information is static, and there’s no directly accessible backend. That’s actually the important part of this—the site needs to be untethered from its original content-management system so no new content can be added. Instead, the content would be served up as a barebones static site (perhaps with advertising, if they roll that way), so as to minimize the “attack surface” left by a site that is not actively being maintained.

This reflects relatively recent best practice in the content-management space. Platforms like Netlify have gained popularity in recent years because they actively separate the form of distribution from the means of production, meaning that security risks are minimized. This is a great approach for live-production sites, but for sites that are intentionally meant to stay static, it removes one of the biggest risk factors that might discourage a content owner from continuing to maintain the work.

As far as liability concerns go, language could be included on the page to allow for users to remove old content if they so choose, along the lines of the “right to be forgotten” measure of the European Union’s General Data Protection Regulation (GDPR), though that measure includes a carve-out for purposes of historical research, which an archived version of a website would presumably cover. But the thing is, sites that are driven by user-generated content are generally protected by Section 230 in the United States anyway, so the onus for liability for the content itself falls onto the end user.

And if, even after these steps, a company still feels uncomfortable about hosting a dead website, they should reach out to librarians and archivists to donate the collection for maintenance purposes—perhaps with a corresponding donation to said nonprofit so they can cover the hosting costs. The Internet Archive actually offers a service like this!

The one site that makes me think that a model like this could work is Gawker. Before its 2021 relaunch (sadly, since shuttered), the site remained encased in amber. Comments were closed and not visible to end users, which is a true shame as those comments often fed into the writing. But the content—the part that was truly valuable and important—was there, accessible and readable, even if you can’t do anything with it other than read it.

When Gawker laid dormant, this kept its memory alive until someone cared enough to revive it again. This feels like a model that could work for the rest of the internet.

Look, I’m going to be the first to fully admit that the motivations for protecting publicly accessible user-generated content simply remain only if the owner of that content feels “nice” about it.

And even then it feels like a bit of a surprise.

It’s still online, but it moved.

A while back, Warner Bros. got a little bit of flak for replacing its long-online Space Jam website, which dated back a quarter-century in its original form, with site for the sequel. But I think what the company did was actually shockingly noble. They not only left the old site online, but they made it accessible from the new one. The work done to maintain this was not perfect—I think they should do archivists a solid by putting in 301 redirects on the old URLs of the vintage site, so they go to the new place—but the fact that they showed the initiative at all is incredibly impressive given what we’ve seen of corporate motivations when it comes to preservation.

Honestly, part of this was a result of people who were associated with the website’s creation still being at the company years later and being willing to speak up for preserving it—a 2015 Rolling Stone article explains that the site actually briefly was taken down after it went viral in 2010, only for employees involved in the creation of the site (now with leadership roles in the company) to swoop in and save it after some executive made the call to shut it down.

“If we had left the company, the site probably would not exist today,” said Andrew Stachler, one of the employees involved with saving the effort. “It would’ve gone down for good at that time.”

But imagine if they weren’t there. We’d be telling a different story right now.

And perhaps that’s what many companies need—someone who is willing to go to bat for the purposes of archival and protection of historic content.

In the digital age, preservation is the act of doing nothing but minimal upkeep and being comfortable with that fact. As proven time and time again, companies are more than comfortable with killing services entirely rather than leaving well enough alone.

Perhaps the way to save user-generated content is by making it as painless as possible to keep the status quo.

--

Find this one an interesting read? Share it with a pal!

And thanks again to Morning Brew for sponsoring.