Words, Links, and Centrality: Evaluating 17 Years of Gadgetopia Content
I’ve been blogging at Gadgetopia for almost 17 years. Here’s my first post from 2002.
I’ve decided to do some clean-up.
Some history –
Before Twitter, blogs and RSS were Twitter, so a lot of my posts from 2003-2005 were very tweet-like – they were very short, often less than 100 words, and just contained a single link. They were admittedly low-value. I was often just throwing out a link that my audience might find interesting, and adding some context around it (and, okay, I would often throw in an opinion or…12).
And let you think the word “audience” is overstating things, know that back in the heady “blogosphere” days in the early Naughts, we did have defined audiences. RSS, comments, and pingbacks coalesced to form loose communities, much like social networks do now. I still know and interact with people who I first “met” in Gadgetopia comments 10-15 years ago. I’ve met several of them in-person.
A big problem for Gadgetopia is that many of those old posts have broken links, and many times, the link was the entire point of the post. This means that they’re a segment in a chain of link rot. Without the link, the post doesn’t make sense, and I’m just cluttering up the internet.
So, it’s time for a purge. I’m determined to clean up old posts by removing them intelligently. I won’t “404 Not Found” them (yes, I used that as a verb) – that status code means “Nothing was ever here.” Rather, I’ll “410 Gone” them, which means, “There was something here, but it has been removed.” Additionally, I’ll provide contact information so if someone really wants something, they can get in touch, and I’ll dig it out of the archives just for them.
But here’s the thing –
When inventorying content, you cross an important line when you can no longer look at each piece of content individually. At some point, you can’t come to a determination through nuanced perception and analysis, because there’s just too much content.
Gadgetopia has over 7,000 posts. Even if I reviewed 100 blog posts a day – which is harder than you think – it would still take me almost four straight months to sort through everything.
So, what do you do when there’s too much?
Content Evaluation Rules
David is a consultant who specializes in migrations and large-scale site transformations. It was on this first consulting project together that he brought up the idea of “rules.” David had the idea that we would score each piece of content according to a set of metrics, and our first pass of evaluation would be based on that numeric value.
I remember thinking this was really…impersonal. I was convinced nothing could substitute for human-powered, individual evaluation. For me, a content inventory was an artisanal experience.
Honestly, I don’t remember if we actually implemented rules on that project, but a few years later, David gave a great conference session on rule processing. He had a case study where he had done this for a client at scale. What he described there made more sense to me, because the client had a lot of content. I started to come around, and the idea of a rule-based approach has stuck with me.
David has gone on to create a “content exploration” tool called Content Chimera – currently in beta – which allows for the slicing, dicing, scoring, and mass evaluation of content:
Content Exploration and Decision-Making At Scale
That tagline is telling, and we’ll revisit it below.
But, back to Gadgetopia – the only way I can hope to get through all this content is to evaluate posts with the help of some set of rules or metrics. I’m still in the middle of this process, but I thought I’d share with you some of what I’ve learned, and some of the data I’m planning to use when scoring and evaluating Gadgetopia posts.
What Are The Goals of Your Content?
When you evaluate content, you invariably run into the question: what is the overall point of the content?
There’s just no way to figure out whether something stays or goes without evaluating it against some yardstick of value. “Is this worth keeping?” can only be answered with some measure of “worth” in mind.
For Gadgetopia, there are two goals to the content:
To further the CMS industry, at whatever level is possible. I like to think that posts I’ve published have sparked discussions, decisions, or otherwise influenced behavior. (ex: someone told me the other day that their framing of “content operations” came from this post, which they credited as the seminal definition of the term – I don’t know of this is actually true, but it made me feel pretty good)
To provide marketing value for my company, Blend Interactive. Second only to our body of work itself, Gadgetopia is the largest collection of public evidence that we know what we’re doing (yes, even more than my book). I often refer to Gadgetopia posts in proposals, and I send links to people to support a sales effort.
(Note that these are the goals of the content today. As I’ve been going through old content, I was struck by how the goals of a site and its content change over time. Posts that seemed valuable a decade ago seem silly now, and that’s okay. We all change. Even websites.)
In articulating those goals, I’ve been careful to give myself permission to make decisions in flagrant violation of the rules when that decision would impede one the two goals above. (Which raises the question, why even have rules then? We’ll talk more about that in the conclusion below.)
More helpfully, my goals help me define what I’m looking to keep: I want to keep content that somehow benefits the CMS industry with its availability, or helps market my company by demonstrating our competence and breadth of experience.
Rules are helpful to me only to the extent that they assist me in identifying that content.
False Positives and Negatives
Here’s what I originally feared when David proposed using evaluation rules: a rule might make a mistake – it will somehow be applied to some outlier content that fits outside the norm, and that content will either be:
Kept when it doesn’t advance a goal
Discarded when it does advance a goal, or there’s some other intangible or tacit reason why it should be kept
That latter freaks me out more than the former. A pointless post taking up space at some URL doesn’t bother me so much, but a good post that can’t be found anymore is specifically what I’m trying to avoid.
Rules aren’t perfect, and they’re going to make mistakes. The two fears aren’t reasons to discard their use entirely, but we’ll talk more later about how to minimize the chance they’ll do real damage.
David points out that your fear of rules is directly proportional to how permanent your changes are going to be. If you’re actually deleting, well, that can be scary. But if you’re just archiving at some level, or perhaps just removing content from indices, then the ramifications of rules become a lot easier to live with.
Finally, when looking at a content repository as large as this one, you need to ask: would a rule make more or fewer mistakes than a manual review? Without assistance, I’d probably make a ton of mistakes on my own. Worse, without rules to make it easier, I would never even do the evaluation, which might be the biggest mistake of them all.
The first, and most obvious metric might be raw traffic, in the form of page views.
I managed to pull the number of page views each post had gotten since January 1. I’d like to say this was graceful, but I just pulled a CSV from Google Analytics, and parsed it. I snapshotted this at a particular moment in time, but since it’s representative of larger period, it’s a valid metric.
Of course, some posts were published during this time period, so they would have less traffic, and other posts were published within the last year, so they have had less of a chance to develop a “Google presence.” However, this doesn’t matter, because all posts from the last couple of years will likely be retained, so the odds of a false positive based on low traffic is slim.
Clearly, there’s a long tail of posts with zero page views. How long? About 90% of the site. Yes, it’s true – over 6,500 posts have had no views in the last three months. That statistic hurt a little to comprehend, but there you go.
But, let’s go back to my two goals, and put this in perspective – if a post fulfills one of those two goals, does it matter that it doesn’t have any traffic in the three-month span I was checking for? No, it doesn’t. Given my goals, even if a post has no page views, I still might send it out on a proposal once a year or so, and it’s critical for that. So lack of traffic to an otherwise good post isn’t the deciding factor.
Consider this post: Spanning the Gap from Feature to Conversion: Are We Building the Right Bridges?
That’s gotten zero traffic since January 1, but it’s a good post that makes a solid point. I wouldn’t get rid of it based purely on the work I put into it and my belief that it has some value to the broader industry. So, traffic or not, that one stays.
What about the opposite – what if an objectively “bad” post gets a ton of traffic? Should I keep it? No – it doesn’t really fulfill a goal for me and is probably just cluttering up the internet.
This post about the mathematical order of operations is consistently the highest-trafficked post on Gadgetopia due to a quirk of Google indexing (to the point where I put a note at the top directing people to the comments, where the answer to their question is probably located). Do I want to keep this? No. It has nothing to do with anything else on the site.
Also worth noting – I think I might be doing a disservice to the internet by keeping this post up. The search referrers make it obvious that it’s mainly found by kids doing homework, and there are far better resources than this to find the answer to the question. Removing this post would allow people to find the answer more directly.
I actually 410’d a bunch of posts last year. Using the Google Search Console, I found all sorts of posts that were lodged in Google for strange reasons, and were attracting pointless, no-value traffic.
For example, I had a post about some aspect of a mobile phone (I’m being intentionally vague to avoid being indexed for the phrase again). This was just a “pass-through” post – I had a couple of sentences with a link to a news article I had found.
This post got a ton of traffic from Google, year after year. And it was pointless traffic from visitors that didn’t further any goal for me, and who really just needed to click the link in the post. But, they still often didn’t find what they were looking for, so they came back to my post, and filled up the comment with complaints.
So, I wrote some code in WordPress to send back a 410 whenever the post was requested, and added this note to the page:
If you’re looking for information on a [phrase that will not be written here], here is the link you want: [the link]
Here, take a look for yourself. With the 410, Google eventually de-indexed it, and any vestigial traffic due to hard-coded links will find my note.
Traffic is just not an absolute metric for me. It’s interesting to look at, and I might take it into account, but I wouldn’t remove or keep any post based solely on that number.
The second metric is content size, specifically word count. This matters because a lot of my earlier posts were really just tweets-in-disguise (Gadgetopia predates Twitter by six years). The longer a post, the more “meat” it had, and thus the greater potential for contributing toward one of my goals.
I did a search for posts of less than 100 words. Result: 2,805 posts. Most of these (about 1,700) have a single outbound link. There are so many small posts, in fact, that moving the threshold up to less than 300 words still includes 88% of the posts.
On the other end of the scale, here are the longest posts on Gadgetopia:
Content Management is an Emergent Skill: 6,290 words
Interview with Josh Clark: 5,070 words, but most of them not my own
The Truth About CMS Form Builders: 3,711 words
Content Personalization: A Reality Check: 3,596 words
Multi-stage Templating as Progressive Denormalization: Tied at 3,470 words
Points of Differentiation in Headless CMS: Tied at 3,470 words
Clearly, size seems to be very correlated with posts I would keep.
I haven’t reviewed them all, but I can’t think of a single post less than 100 words that I would keep. On the other side, very long posts are likely to be something about CMS as that’s the topic I write most deeply about (the Josh Clark interview even relates – at the time he was running Big Medium).
A possible option here: correlate word count against some of the other, more advanced metrics below and determine the smallest size that could reasonably product a post of marketing or thought leadership value.
Rule: Link Integrity
I’ve decided to clean up because so many links are broken. So, the third rule I’m looking at is outbound link count and integrity – what percentage of links in the post are broken?
A common pattern for me years ago was posting simply as a method to share a link, so I searched for how many posts contained just a single link (3,385) and how many of those single-link posts were broken (2,264).
That’s right: fully 67% of single-link posts are just broken segments in a chain. If someone got to them, they wouldn’t get any further.
But what threshold do you set when just some of the links are broken. When does that invalidate the entire post?
Consider this post from 15 years ago where I discussed (and linked to) all the tools I use to do my job. There are 29 links in that post, and 24 of them are no longer valid. Clearly, this post needs to go.
But what about this post from 16 years ago where I discuss the difference between ephemeral and permanent content (which is ironic in the context of this discussion). Of the 18 links, 5 of them are broken. That’s less than half, but still a lot.
That post in particular is complicated because it’s on a topic I still ponder quite often today. Even though many of the links are bad, the topic is still valid, and I make some points worth preserving. If I decide to keep it, do I somehow call out that I know a lot of the links are broken, but I’m keeping it anyway?
I actually did this with a post last year. I put a note and the top that said:
Sadly the serial number lookup system doesn’t seem to be in the same place any longer. The link no longer works, but the story and reasoning are still interesting, so I’m leaving the post up. Plus, the screencap link still works, so you can see how the system used to work.
And here’s another:
The main link in this article no longer works (as of October 2016), but the quote stands on its own, so I’m leaving this post up.
I’ll probably end up removing these two posts for other reasons (they don’t contribute to either of my two goals), but what about situations like the older post listed earlier, where a critical mass of broken links still doesn’t invalidate the main thrust of the content?
I could envision a script that periodically checked links and put a class on them that popped a message on mouseover that said, “Note: this link doesn’t work anymore,” and maybe a note at the top that warned the reader before they invested time.
So, outbound link integrity is another brush stroke, but – like the other rules – not absolute.
For years, I’ve kept track of intra-site links. When a post is published, I extract all the links to other posts on Gadgetopia. I store these in a database table, so at the bottom of every post, I can provide links to posts which link into that post.
This was an attempt to provide links to posts that were related, but didn’t exist at the time I was writing the first post. It’s worked reasonably well over the years, enabled readers to look “forward in time” to see how the post they’re reading was referenced later on.
Using this data, I attempted to capture something else I’m been interested in for some time: “centrality.”
By centrality, I mean, how central is a post to the “network” of posts on Gadgetopia? Every post which has an outbound link to another Gadgetopia post or an inbound link from another post combine to form a network of posts. You can envision a map with cities and highways which connect those cities. Post A might link to Post B which might link to Post C and form something of a link chain. Mathematically, this is known as a “graph”, and we can do all sorts of analysis on it (there’s an entire science to it: Graph Theory).
I found a graphing library (QuickGraph), and calculated paths from one post to another. There were 1,286 posts with an inbound or outbound link (so they were considered part of the larger network/graph), which resulted in about 1.6 million theoretical paths.
It turned out that only 20,944 could be calculated (if two posts were linked to each other but nothing else, they then formed a “link island” that didn’t connect to any of the other 1,284 posts). Each path was a series of “hops,” and I recorded every post that any hop passed through, then counted up the occurrences.
This, it turns out, is called “betweenness centrality.” I didn’t realize it was a known thing until a mathematician responded to my question on the Math Stack Exchange to tell me that I had re-invented an algorithm. Yay!
Here, then, are the five “most central” Gadgetopia posts:
What Makes a Content Management System: June 2007
Content Publishing Models: June 2006
The Content Tree: August 2005
The Four Disciplines of Content Management: November 2007
Middle Ground: Content Management using Static HTML: November 2005
Those are all very foundational posts. I wrote them all well over a decade ago, linked to them often in the years immediately after, and linked to posts that linked to them in the many years since.
Clearly, centrality is deeply affected by age, since the passing of time makes it more likely that a post will appear in a “link chain”…but that’s not wrong. This is a measure of how central something is, and age is naturally going to affect that.
So, centrality has value. Looking at the list of posts from most central downwards, it absolutely correlates with high-value posts which meet my goals.
The gold standard of a rule would be for it to understand what a post is about, and make a recommendation based on that.
But this is an inherent human-powered concept (-ish; don’t @ me about AI and ML…). And remember that the inability to scale human evaluation is why we’re talking about rules in the first place.
The next best option for automated processing is keyword analysis: what specific words/tokens appear together in content that fits a certain profile?
I’ve identified some words that I feel like would signify a valuable post. Here is my first, rough pass at it:
CMS content management editor publishing workflow approval scheduling expiration vendor URL
I started to write my own algorithm to find these words, when I realized was just re-writing a scoring search engine. So I rigged up a script using Lucene.NET to index all the posts, then run a big query with all those terms concatenated with an OR operator, and then score the result, with the title of a post boosted to 1.2x.
Based on the keyword list above (which is subject to change), here are the top posts:
What Makes a Content Management System: June 2007; Score: 1.53
Content Publishing Models: June 2006; Score: 1.09
Patterns in URL Redirection After CMS Migrations: August 2016; Score 0.86
You Want Collaboration, Not Workflow: June 2014; Score 0.79
The Why and How of CMS Vendor Partnerships: April 2009; Score: 0.76
(Hilariously, when I publish this post, it will no-doubt acquire the highest score, since it literally contains a list of every single keyword, perfectly adjacent to each other.)
This method is very promising, because I’m scanning down the list of the top 100 posts by score, and there’s maybe three posts I don’t think I would keep.
What’s also interesting is that my list of keywords keeps changing as I look through the results. It shows me posts that I absolutely know I plan to keep, and they help jog my memory of other keywords that might reveal other posts. So, this will take some time to evaluate, as I refine my list of words.
(Yes yes – this could be automated. I could find a set of posts I know I will keep, then use term vectoring to find other posts which are similar to them. And, again, don’t think I haven’t considered the AI/ML options, but I just don’t know enough about that field to make any good decisions around it)
After sifting through old posts for a week, and slicing and dicing them based on the metrics from above, I keep coming back to David’s tag line for Content Chimera:
Content Exploration and Decision-Making At Scale
That’s really what this is: an exploration of content.
Rules are less of a final arbitrator, and more a rough tool that helps you explore a repository of content and make preliminary judgments with a relatively high degree of accuracy. I suspect I could select content (1) based purely on rules, or (2) based on pure manual evaluation, and realize that my results are 99% the same.
Additionally, the definition and development of these rules has forced me to think critically about my content and figure out what particular metric signals that a piece of content is worth keeping. Or at least worth reviewing to determine if I should keep it.
For me, I’m looking mainly for:
Longer blog posts…
With a good keyword score…
And some measure of centrality.
What I’m going to use my rules for is to rule out content that has none of the above. Looking at those three metrics, I’m pretty sure I can strike out 90% of the site with confidence that I’m not removing anything of value. If something is less than, say, 800 words, with a low keyword score and no centrality to speak of, then it’s just gonna go.
Note that I’m not going to delete it forever – I’m just going to change the status and return a 410 with my contact information. If I see a lot of 410s for a particular piece of content, I can review on a case-by-case basis to see if I made a mistake.
As for the remaining 700-800 posts, I’ll review link integrity as a second pass, then manually review from there. And as for traffic, I’ve decided not to pay attention to it, for reasons I explained above.
And this gets to the zen of it all –
We need to acknowledge that content has goals and a purpose, and that purpose is different for every site and every creator. If you don’t know what the goals of your content are, you simply can’t make an informed decision about whether or not to keep it.
Rules do not take the place of goals, and that was never their intention (apologies to David for my reaction to his plan in 2009). They only make you more efficient in finding content that fulfills the goals you’ve defined.