Traffic is just not an absolute metric for me. It’s interesting to look at, and I might take it into account, but I wouldn’t remove or keep any post based solely on that number.
Rule: Size
The second metric is content size, specifically word count. This matters because a lot of my earlier posts were really just tweets-in-disguise (Gadgetopia predates Twitter by six years). The longer a post, the more “meat” it had, and thus the greater potential for contributing toward one of my goals.
I did a search for posts of less than 100 words. Result: 2,805 posts. Most of these (about 1,700) have a single outbound link. There are so many small posts, in fact, that moving the threshold up to less than 300 words still includes 88% of the posts.
On the other end of the scale, here are the longest posts on Gadgetopia:
Clearly, size seems to be very correlated with posts I would keep.
I haven’t reviewed them all, but I can’t think of a single post less than 100 words that I would keep. On the other side, very long posts are likely to be something about CMS as that’s the topic I write most deeply about (the Josh Clark interview even relates – at the time he was running Big Medium).
A possible option here: correlate word count against some of the other, more advanced metrics below and determine the smallest size that could reasonably product a post of marketing or thought leadership value.
Rule: Link Integrity
I’ve decided to clean up because so many links are broken. So, the third rule I’m looking at is outbound link count and integrity – what percentage of links in the post are broken?
A common pattern for me years ago was posting simply as a method to share a link, so I searched for how many posts contained just a single link (3,385) and how many of those single-link posts were broken (2,264).
That’s right: fully 67% of single-link posts are just broken segments in a chain. If someone got to them, they wouldn’t get any further.
But what threshold do you set when just some of the links are broken. When does that invalidate the entire post?
Consider this post from 15 years ago where I discussed (and linked to) all the tools I use to do my job. There are 29 links in that post, and 24 of them are no longer valid. Clearly, this post needs to go.
But what about this post from 16 years ago where I discuss the difference between ephemeral and permanent content (which is ironic in the context of this discussion). Of the 18 links, 5 of them are broken. That’s less than half, but still a lot.
That post in particular is complicated because it’s on a topic I still ponder quite often today. Even though many of the links are bad, the topic is still valid, and I make some points worth preserving. If I decide to keep it, do I somehow call out that I know a lot of the links are broken, but I’m keeping it anyway?
I actually did this with a post last year. I put a note and the top that said:
Sadly the serial number lookup system doesn’t seem to be in the same place any longer. The link no longer works, but the story and reasoning are still interesting, so I’m leaving the post up. Plus, the screencap link still works, so you can see how the system used to work.
And here’s another:
The main link in this article no longer works (as of October 2016), but the quote stands on its own, so I’m leaving this post up.
I’ll probably end up removing these two posts for other reasons (they don’t contribute to either of my two goals), but what about situations like the older post listed earlier, where a critical mass of broken links still doesn’t invalidate the main thrust of the content?
I could envision a script that periodically checked links and put a class on them that popped a message on mouseover that said, “Note: this link doesn’t work anymore,” and maybe a note at the top that warned the reader before they invested time.
So, outbound link integrity is another brush stroke, but – like the other rules – not absolute.
Rule: Centrality
For years, I’ve kept track of intra-site links. When a post is published, I extract all the links to other posts on Gadgetopia. I store these in a database table, so at the bottom of every post, I can provide links to posts which link into that post.
This was an attempt to provide links to posts that were related, but didn’t exist at the time I was writing the first post. It’s worked reasonably well over the years, enabled readers to look “forward in time” to see how the post they’re reading was referenced later on.
Using this data, I attempted to capture something else I’m been interested in for some time: “centrality.”
By centrality, I mean, how central is a post to the “network” of posts on Gadgetopia? Every post which has an outbound link to another Gadgetopia post or an inbound link from another post combine to form a network of posts. You can envision a map with cities and highways which connect those cities. Post A might link to Post B which might link to Post C and form something of a link chain. Mathematically, this is known as a “graph”, and we can do all sorts of analysis on it (there’s an entire science to it: Graph Theory).
I found a graphing library (QuickGraph), and calculated paths from one post to another. There were 1,286 posts with an inbound or outbound link (so they were considered part of the larger network/graph), which resulted in about 1.6 million theoretical paths.
It turned out that only 20,944 could be calculated (if two posts were linked to each other but nothing else, they then formed a “link island” that didn’t connect to any of the other 1,284 posts). Each path was a series of “hops,” and I recorded every post that any hop passed through, then counted up the occurrences.
This, it turns out, is called “betweenness centrality.” I didn’t realize it was a known thing until a mathematician responded to my question on the Math Stack Exchange to tell me that I had re-invented an algorithm. Yay!
Here, then, are the five “most central” Gadgetopia posts:
Those are all very foundational posts. I wrote them all well over a decade ago, linked to them often in the years immediately after, and linked to posts that linked to them in the many years since.
Clearly, centrality is deeply affected by age, since the passing of time makes it more likely that a post will appear in a “link chain”…but that’s not wrong. This is a measure of how central something is, and age is naturally going to affect that.
So, centrality has value. Looking at the list of posts from most central downwards, it absolutely correlates with high-value posts which meet my goals.
Rule: Keywords
The gold standard of a rule would be for it to understand what a post is about, and make a recommendation based on that.
But this is an inherent human-powered concept (-ish; don’t @ me about AI and ML…). And remember that the inability to scale human evaluation is why we’re talking about rules in the first place.
The next best option for automated processing is keyword analysis: what specific words/tokens appear together in content that fits a certain profile?
I’ve identified some words that I feel like would signify a valuable post. Here is my first, rough pass at it:
CMS
content management
editor
publishing
workflow
approval
scheduling
expiration
vendor
URL
I started to write my own algorithm to find these words, when I realized was just re-writing a scoring search engine. So I rigged up a script using Lucene.NET to index all the posts, then run a big query with all those terms concatenated with an OR operator, and then score the result, with the title of a post boosted to 1.2x.
Based on the keyword list above (which is subject to change), here are the top posts:
(Hilariously, when I publish this post, it will no-doubt acquire the highest score, since it literally contains a list of every single keyword, perfectly adjacent to each other.)
This method is very promising, because I’m scanning down the list of the top 100 posts by score, and there’s maybe three posts I don’t think I would keep.
What’s also interesting is that my list of keywords keeps changing as I look through the results. It shows me posts that I absolutely know I plan to keep, and they help jog my memory of other keywords that might reveal other posts. So, this will take some time to evaluate, as I refine my list of words.
(Yes yes – this could be automated. I could find a set of posts I know I will keep, then use term vectoring to find other posts which are similar to them. And, again, don’t think I haven’t considered the AI/ML options, but I just don’t know enough about that field to make any good decisions around it)
Conclusion
After sifting through old posts for a week, and slicing and dicing them based on the metrics from above, I keep coming back to David’s tag line for Content Chimera:
Content Exploration and Decision-Making At Scale
That’s really what this is: an exploration of content.
Rules are less of a final arbitrator, and more a rough tool that helps you explore a repository of content and make preliminary judgments with a relatively high degree of accuracy. I suspect I could select content (1) based purely on rules, or (2) based on pure manual evaluation, and realize that my results are 99% the same.
Additionally, the definition and development of these rules has forced me to think critically about my content and figure out what particular metric signals that a piece of content is worth keeping. Or at least worth reviewing to determine if I should keep it.
For me, I’m looking mainly for:
What I’m going to use my rules for is to rule out content that has none of the above. Looking at those three metrics, I’m pretty sure I can strike out 90% of the site with confidence that I’m not removing anything of value. If something is less than, say, 800 words, with a low keyword score and no centrality to speak of, then it’s just gonna go.
Note that I’m not going to delete it forever – I’m just going to change the status and return a 410 with my contact information. If I see a lot of 410s for a particular piece of content, I can review on a case-by-case basis to see if I made a mistake.
As for the remaining 700-800 posts, I’ll review link integrity as a second pass, then manually review from there. And as for traffic, I’ve decided not to pay attention to it, for reasons I explained above.
And this gets to the zen of it all –
We need to acknowledge that content has goals and a purpose, and that purpose is different for every site and every creator. If you don’t know what the goals of your content are, you simply can’t make an informed decision about whether or not to keep it.
Rules do not take the place of goals, and that was never their intention (apologies to David for my reaction to his plan in 2009). They only make you more efficient in finding content that fulfills the goals you’ve defined.