In The Age Of Culling

Discussing the dumb thing CNET did in an effort to please the Google Gods: Don’t cull old news content to improve your SEO ranking. That’s your history!

Today in Tedium: In the past, I’ve been effusive of my praise of CNET, a news outlet that (along with Wired) pioneered digital journalism, for one specific reason: Its archives have been kept safe from meddling. It is one of the most in-depth archives of news that we have from the early years of the internet, and it is arguably of the most important flavor—day-to-day, standard issue news content. But its ownership picture has changed in recent years, with the media holding company Red Ventures purchasing it (and its subsidiaries, most notably ZDNet, which hasn’t been attached to Ziff Davis for more than 20 years) for $500 million back in 2020. Red Ventures has made some controversial moves with the CNET property, most infamously bringing AI into the mix, but the latest move was like a dagger to the heart: Gizmodo revealed that the company was actively culling its utterly massive archives for search-engine optimization reasons. I’ve talked about killing sites before, but in today’s Tedium, let’s talk culling, and why it’s often just as bad. — Ernie @ Tedium

The Smithee Letter is a sales letter meets David Lynch meets Cormac McCarthy meets Harold Pinter meets Sarah Ruhle. The winding, dark, strange character study is fictional but the products and brands are very real. So, do your part and save "Smithee" today by subscribing and clicking every email like their life depends on it, because it does. Save "Smithee" Now

“Content pruning removes, redirects, or refreshes URLs that are no longer relevant or helpful to our audience, including outdated or redundant content that may confuse or frustrate users looking for up-to-date information. (For example, a 1996 article about available AOL service tiers is no longer relevant or useful.) Also called content deprecation, it sends a signal to Google that says CNET is fresh, relevant and worthy of being placed higher than our competitors in search results.”

— A description of “content pruning,” the process CNET’s archive is undergoing, in an internal memo acquired by Gizmodo reporter Thomas Germain ahead of the scoop he wrote about CNET earlier this week. Despite CNET’s claims, Google Search Liaison Danny Sullivan directly refuted CNET’s claims that this was an effective strategy in a tweet he posted just ahead of Germain’s story.

This is not a news website. (Devin H/Unsplash)

A news website is not a bonsai tree. Often, it’s a library and an archive. You shouldn’t prune a news archive.

A few years ago, I wrote a piece that described the domain name system (DNS) as a system of branches built from a series of top-level domains. I think that comparison largely stands, but I do think that when you get to the website level, the comparison falls apart to some degree.

Oh sure, there are folders and directories, and those branch off in many ways akin to the world’s most complex flow chart, but in some ways, the information architecture starts to matter a little less. If you look at the tree of the domain system as branches when you finally get to the leaves, you are going to look at those leaves as disposable.

The problem is, those leaves are often not disposable. Much the opposite. They do not fall off the tree every fall after growing through the spring and summer. They in many cases, just stay there, maybe less visible than they once were, but still central to what the site represents. If they fall off, the tree is not doing its job. It’s the point where the tree metaphor to describe the internet totally breaks, if you get my drift.

If we did look at websites as trees with branches full of meaningless tiny leaves, it would be all too easy to see every website as a bonsai tree in waiting.

Yes, bonsai, the Asian art of growing miniature trees and pruning them perfectly, perhaps makes sense to some degree as a metaphor for websites that can be redesigned and restructured as necessary. But when it comes to the information architecture, the metaphor rings hollow over time, because it does not show proper deference to the information.

Now, it’s true that not every website is going to have long-term value. The term ephemera comes to mind, and sometimes, ephemeral things should be allowed to fade away with the passage of time. (Unless you’re Ted Nelson and you keep those things.) Maybe you decide that your marketing blog shouldn’t live forever, that you would prefer to class up the joint and aim for a different approach.

But there’s a real risk of looking at a website as a tree you should prune, rather than a library or archive of information intended to be analyzed and studied over time. The value of the tool is that it has been allowed to grow and build into a robust archive of information. If the leaves are allowed to fall off the tree, following the metaphor, the tree grows significantly less valuable.

Despite my well-known opinions on backlink begging, I don’t think search engine optimization is necessarily evil, or even a necessary evil. I think it’s something that can make a lot of sense in a lot of contexts. The problem is, when you are building a resource to last the test of time, it can feel diametrically opposed to the goals of why you created the resource in the first place.

Can you believe that people used to laminate leaves in books for safekeeping? (Jeremy Thomas/Unsplash)

But to be clear, ignore what the SEO experts say: News sites should never be trimmed with a tiny pair of clippers, because all their value is in the leaves. The branches are just a conduit.

You wanna keep the leaf metaphor? How about this: A news site is a collection of leaves laminated in books for safekeeping. If we fail to keep them safe, what was the point of collecting the leaves?

CNET, for years, was doing this right. But now, with a change in ownership, it’s throwing out its laminated leaves. And that’s a damn shame.

“This is an industry-wide best practice for large sites like ours that are primarily driven by SEO traffic.”

— CNET spokesperson Taylor Canada, speaking in the Gizmodo piece. You can see why CNET was fooled—just do a search for “content pruning” on Google, and you’ll find that basically every single SEO expert under the sun recommends content pruning as an effective strategy for removing dead weight, essentially based on perceived signals from Google. So for a top Google search expert to literally go out and say that “no, you shouldn’t do it,” as Danny Sullivan did this week, is basically throwing years of SEO “best practices” out the window with a single tweet. It’s kind of hilarious in hindsight.

A shot of CNET’s studios, circa 2014. (Nan Palmero/Flickr)

CNET pruning its content is a harbinger of something bigger

To be clear, I am not a fan of CNET removing its content from its website in this way. It favors the present to the past, and removes important dots of context that future generations of researchers and journalists can use to tell important stories. And CNET owns perhaps the most complete archive of an important period of technology—the 1995 to 2005 period where the internet first took off, disrupted industries, and reshaped our relationship with the world.

But let’s be clear here. CNET is not a unique case. I am going to tell you right now that CNET is not the first website that has removed or pruned its archives, or decided to underplay them, or make them hard to access. Far from it.

Often, what I have noticed over the years is that websites will often go through major redesigns and decide not to pull certain types of content over. A lot gets lost in the process, and there are many reasons to not put in the work, including time, complexity, and resources.

Or maybe they will just not put in the work to make that information accessible without a lot of legwork.

To offer a few examples:

  1. USA Today is 40 years old, and has been online for 27 of those years. Its sitemap only goes back to October of 2012, when it conducted a major redesign. While its content is archived in print from elsewhere, that means essentially half of its digital history is just not there. Older content, when accessible, appears to be buried at the subdomain usatoday30.usatoday.com, but that redirects to the main page and is hard to search via Google.
  2. Newsweek is prominently celebrating its 90th anniversary this year, but its archive page only covers issues dating back to 2013, the year IBT Media bought the company from IAC, the parent company of The Daily Beast. The content from its legacy years (most famously this piece) is still there and can be searched, but it is not obviously organized.
  3. PC World, unlike its longtime competitor CNET, does not offer an easily accessible sitemap for its content, but it a little digging around uncovers the fact that it once did, and many of the articles in the sitemap are no longer accessible. The oldest article in the PC World archives is from 1997, and odds are it was likely missed in the pruning process, because it is one of only three articles currently on the site from the 1990s.
  4. People.com, while its archive appears to be online, does not make it easy to access said archive, because of the way the site is designed. There is no user-accessible sitemap or easy to use pagination, and its search functionality includes no visible date on the articles.
  5. Pitchfork has strong historic archives at this point, in part because search is a key reason why the site works, but it has been known for removing reviews in the past, including old reviews that didn’t reflect its later editorial sensibility. It notably faced a particularly odd situation around the time of its purchase by Condé Nast: A onetime reviewer became a very loud critic of the site, at one point making legal threats, and the site responded by removing that writer’s entire body of work.

Of these examples, the most egregious is easily USA Today, which as a national newspaper, fits under the “paper of record” category. One could argue that the company has a backstop in the form of online archives, but the company has literally decades of broken links inaccessible online. Poof.

Now, to be clear, I do not blame small teams struggling to keep up with the huge tide of content for being unable to maintain the past. It’s a lot of work, even in the best of circumstances. Moving from one CMS to another takes time and money, both of which are sometimes in short supply.

(Alan Levine/Flickr)

But I only bring up these failings to point out that, while CNET is clearly making a mistake here, it is one that many other sites make in ways big and small every single day. The difference in the case of CNET is that it appears to be intentional and not just an oversight or the result of institutional history going away. CNET intentionally put in the work to protect its historic content in the first place, and now it’s putting in the work to bury it.

CNET’s editorial practices understandably have a laser focus on them right now, in no small part because the outlet was the first to start publishing AI-generated content. But let’s not give the rest of the media world a pass because CNET was the one in the headline.

These articles help expose the past; they were the hard work of people who spent a lot of time building them.

Five news sites that do their archives exceptionally well

  1. The New York Times. I have my problems with them in part because of their handling of trans issues, but they really set the standard that the rest of the industry has followed.
  2. The Washington Post. Not only are many of their articles accessible digitally (I have linked stories from them from the 1970s), but they offer their subscribers access to their print archives via ProQuest, making their content accessible at a level familiar to many academics.
  3. Time Magazine. Nearly every Time Magazine cover from the start of its history until today is accessible from its vault, and many of its articles are also freely accessible.
  4. The New Yorker. It not only offers full scans of its magazine archive to subscribers, it also takes time to resurface its archive content on the website.
  5. The Los Angeles Times. While the newspaper’s archive site shunts off the newspapers themselves to Newspapers.com, it has a strong historic sitemap that includes full articles from 1985 on and a selection of pieces from prior years.

I think an important distinction to make here, as I close this piece out, is that we should not expect the Internet Archive’s Wayback Machine to save us. Sure, it certainly can, and its affiliated archive.it site has collected much of CNET’s history.

But we should not expect it. The Internet Archive is an important resource, but it is a fallible one and an overworked one. These companies have the money and resources. They should be the ones protecting their own work. The Internet Archive is there to clean up your mess, not excuse your failings.

Ultimately, we should encourage better standards from publishers and other players in the content ecosystem. The reason why this content dies is because not enough work has been done to properly value it.

There are at least three distinct ways I could see other sites not fall down the CNET trap in the digital era:

I think CNET is making a mistake, but they’ve also identified a serious problem for the digital news ecosystem: Old content is a burden to carry, and intentionally or not, Google isn’t making that lift any easier.

I think as researchers, journalists, and SEO heads, we need to put our heads together to figure out ways to make maintaining historic content look less like a burden and more like an opportunity.

Our history is at risk if we can’t.

--

Find this one an interesting read? Share it with a pal!

And thanks again to The Smithee Letter for sponsoring.