Problem: Pages taken offline 4 months ago are still indexed by Google and Yahoo!
In the course of work this week, we discovered that Yahoo! and Google still have record of a section of a website we removed nearly 4 months ago.
Surely the search engine robots had revisited the pages, repeatedly received 404-File Not Found errors, and proceeded to removed those pages from their indexes, right?
Apparently not. Here’s one explanation as to why.
Context: Using robots.txt in the process of removing old pages and posting new ones
I had also recently glanced at the robots.txt file for the domain in question, and noticed that there was a disallow rule applied to the pages that we had removed.
Four months ago, we released a new web service to replace one provided by these old pages. For a period of time, we kept both online with forwarders in place. During this time we added an entry to robots.txt to prevent spiders from indexing the old pages, and we encouraged indexing of the new pages by providing links to them.
Once we removed the old pages, we didn’t think to remove the entry from robots.txt. Afterall, the pages weren’t there, why would it matter?
Well, my revised theory holds that it does matter.
Hypothesis: Excluding the robot with robots.txt stopped it from even checking that the page was there. So, with no confirmation that the page was gone, it didn’t tell the indexer to update the records.
Here’s the scenario, from a search engine spider’s perspective.
(This model probably doesn’t technically match what’s going on in reality, but I hope it’s close enough to get some insight from.)
As a spider, I crawl links to pages, and when I find good pages (HTTP response of 200-OK), I read what textual data I can and send it back to the indexer to update our records.
However, if I try to get to a page, but the server tells me the page isn’t there, I send that as a note back to the indexer, and proceed to try to read the next page.
After receiving a few of these File Not Found messages over time about the same page, the indexer will remove that record from the index, as a matter of housekeeping.
Our problem may have been that we posted a robots.txt file which prevented the spiders from even trying to access these pages, and so when we had the pages removed, the spiders never had a chance to get the 404 error. So, they never communicated back to the indexer that there was a problem with the pages. So, the indexer never triggered its housekeeping activities and has left the pages referenced in its index.
Moral of the story: Don’t screw with spiders.
Had we left robots.txt well enough alone, the spiders would have found the bad links and soon enough the indexes would have been updated. Because we short-circuited their processes, we have preserved index reference to pages that have long since died.
