The Peril of Self-Replicating Hyperlinks

By Deane Barker

I built an intranet for a client. One of the functional items is a viewer into an Exchange calendar. We use a handy third-party component to display the contents of an Exchange public folder on a page. The month and year to be viewed is driven off the querystring. Something like:…

The author discusses an issue with an intranet system that uses a third-party component to display an Exchange calendar. The system’s indexing via a Google Mini led to the crawl of the intranet exceeding its document limit and the creation of new URLs for every new page. The author has since adjusted the system to limit the number of out-of-bounds URLs, reducing the number of pages in the Mini.

Generated by Azure AI on June 24, 2024

I built an intranet for a client. One of the functional items is a viewer into an Exchange calendar. We use a handy third-party component to display the contents of an Exchange public folder on a page.

The month and year to be viewed is driven off the querystring. Something like:

/month.aspx?m=11&y=2010

So you can look at any month by writing your own querystring. We check for valid input and everything, but so long as you enter a valid month and year in the querystring, you can (could) look up any logical month in existence, as far ahead or behind as you want.

Each month has helpful “Next” and “Previous” links on it that form the URL for the next or previous month.

Sadly, we’re also indexing the intranet via a Google Mini.

Astute readers will see the problem here…

Two things happened:

  1. The number of pages in the Mini spiked. The client was suddenly hitting their document limit. They only had about 10,000 actual pages of content, but the Mini was claiming it had indexed four or five times that number.

  2. We started to get reports about odd months being returned in search results. Months like “November 2609” for example…

The Mini’s crawler, bless its heart, was dutifully following the “Next” and “Previous” links in the calendar into infinity in either direction. It was, in effect, inventing its own URLs…forever. Every new page in the calendar gave it a new URL it hadn’t seen before. The Mini’s crawler had fallen down the rabbit hole.

Easy problem to fix, but an embarrassing oversight nonetheless. We now drop the “Next” and “Previous” links at 24 months out in either direction, and we throw a 410 for anything outside those bounds in the past, and a 404 for anything outside those bounds in the future.

I just checked today, and the number of pages in the Mini came down 2,000 yesterday, as it rechecks out-of-bounds URLs and gets back 410s and 404s.

I wonder how many sites on the public Internet have this same problem? I wonder if crawlers have any logic to detect this?

This is item #185 in a sequence of 357 items.

You can use your left/right arrow keys to navigate