Can robots.txt prevent a dead page from being removed from a search engine’s index?

Problem: Pages taken offline 4 months ago are still indexed by Google and Yahoo!

In the course of work this week, we discovered that Yahoo! and Google still have record of a section of a website we removed nearly 4 months ago.

Surely the search engine robots had revisited the pages, repeatedly received 404-File Not Found errors, and proceeded to removed those pages from their indexes, right?

Apparently not. Here’s one explanation as to why.

Context: Using robots.txt in the process of removing old pages and posting new ones

I had also recently glanced at the robots.txt file for the domain in question, and noticed that there was a disallow rule applied to the pages that we had removed.

Four months ago, we released a new web service to replace one provided by these old pages. For a period of time, we kept both online with forwarders in place. During this time we added an entry to robots.txt to prevent spiders from indexing the old pages, and we encouraged indexing of the new pages by providing links to them.

Once we removed the old pages, we didn’t think to remove the entry from robots.txt. Afterall, the pages weren’t there, why would it matter?

Well, my revised theory holds that it does matter.

Hypothesis: Excluding the robot with robots.txt stopped it from even checking that the page was there. So, with no confirmation that the page was gone, it didn’t tell the indexer to update the records.

Here’s the scenario, from a search engine spider’s perspective.

(This model probably doesn’t technically match what’s going on in reality, but I hope it’s close enough to get some insight from.)

As a spider, I crawl links to pages, and when I find good pages (HTTP response of 200-OK), I read what textual data I can and send it back to the indexer to update our records.

However, if I try to get to a page, but the server tells me the page isn’t there, I send that as a note back to the indexer, and proceed to try to read the next page.

After receiving a few of these File Not Found messages over time about the same page, the indexer will remove that record from the index, as a matter of housekeeping.

Our problem may have been that we posted a robots.txt file which prevented the spiders from even trying to access these pages, and so when we had the pages removed, the spiders never had a chance to get the 404 error. So, they never communicated back to the indexer that there was a problem with the pages. So, the indexer never triggered its housekeeping activities and has left the pages referenced in its index.

Moral of the story: Don’t screw with spiders.

Had we left robots.txt well enough alone, the spiders would have found the bad links and soon enough the indexes would have been updated. Because we short-circuited their processes, we have preserved index reference to pages that have long since died.

Restricting search indexing to sections of a web page

If you think about websites as having different page types, with each page type having different sections within it, such as content sections, navigation sections, and footer sections, it becomes apparent that the value of a particular page is defined by the content on the page that is unique to that page. Sections like footers or site-wide navigation systems are repeated on each page, and give no specific extra value to that page.

So, it would be helpful to be able to instruct search engine robots to not index specific areas of the web page. Here’s a wireframe of what I’m thinking.

Wireframe showing regions of a page that should and shouldn't be indexed by search engines.
Wireframe showing regions of a page that should and shouldn't be indexed by search engines.

How? Well, I haven’t found a real solution. Here’s an idea though.

We could extend XHTML with a schema that would include the ability to add attributes to elements like DIVs, ULs, OLs, Ps, and so on.

The attributes could be along these lines:

<div robot-follow=”yes” robot-index=”no”>Stuff you don’t want indexed here</div>

Of course, then the makers of the bots would need to program to heed these attributes.

So, yeah, all-in-all, a fairly impractical idea as nothing is implemented. However, if it were, I would use it on many websites.

Web Site Visibility class

Today I taught a course in Web Site Visibility. It was fun. It is really the first course I’ve been able to teach that I was actually able to draw heavily on my experience with the web.

It was a small course, and I realized after talking with the participants that the name of the course is misleading. Some people seemed to think the course was more concerned with web site accessibility (how to make web sites work for people with disabilities). Others thought it would have more details on visual design (the visibility part of the title).

The course was concerned with ways of getting your prospective site visitors to turn into actual site visitors. So, it covered topics like invisible sites (the dark web), how search spiders work, design and content considerations, site promotion, differences between search engines and directories, and some techniques on assessing how well you are doing with managing your site’s visibility.

So, what’s a better name for the course?

  • Web Site Promotion
  • Web Site Marketing
  • Advertising Your Web Site
  • Introduction to Search Engine Optimization
  • Making Easy-to-Find Web Sites

Other ideas?

Kudos to Johnson Consulting Network

So as part of a competitive analysis for my company, I searched Google for “web consulting in lansing michigan”. And, the first response is to jcn.com. (Envision Internet Consulting made the 4th site slot) I searched for “internet consulting in lansing michigan”, and jcn.com shows up as the second response. The first is taken by Michtek consulting, whose site happens to be DOWN right now. They have some lame excuse of needing to upgrade their web site because of increased business. Like that’s an excuse to take your site down.

The strange thing is, it seems like most of jcn.com’s business is from providing web hosting services, and not so much from actually working on web sites. Interesting. And, I wonder what the “network” part of Johnson Consulting Network is all about. Guess I’ll ask.