Crawling vs. Indexing: Robots.txt and sitemap.xml

April 15, 2014
Crawling vs. indexing: robots.txt and sitemap.xml

Sometimes you need to prevent a site, page or everything at a particular path from showing up in Google search. I've heard people say to just disallow the page in robots.txt file. This is actually incorrect.

If you simply add a disallow to your robots.txt file, then it is true that Google will not "crawl" that page, but if you have that page in your sitemap.xml file then the page will still be submitted to the Google Index.

You will find your disallowed content in Google search.

For example

Scenario:

  • I have created a view with client logos and I don't want each node to be seen. I only want the nodes to appear inside my view.
  • The nodes are at a /clients path.

In this scenario, I will do just one thing:

  • Add noindex to the page

A common error I've seen is for people to add:

  • disallow: /clients to robots.txt

I've seen people perform this update and think this will prevent the page from showing in Google search. However, if you are using the sitemap.xml module to create your sitemap.xml file, chances are you also have it set to auto submit your index to Google and possibly other search engines.

Crawling vs. indexing

You have prevented Google from crawling your site, but you may have just allowed your sitemap module to autosubmit the url to be indexed. It will not crawl further after this link, but it has still been indexed.

Note: You should also configure your sitemap.xml module correctly, but that is another topic.

Robots.txt tells Googlebot and other crawlers what is and is not allowed to be crawled; the noindex tag tells Google Search what is and is not allowed to be indexed and displayed in Google Search.

One step further

This part hurts my brain, but there seems to be consensus that you should not add items you want to prevent from search to your robots file. Instead, you should allow them to be crawled and have a noindex tag on them to then ensure Google knows not to display them.

Development environments

On a development server, this is much simpler. You can tell Apache to globally disallow all files. In the Apache httpd.conf file, add this:

# Globally disallow robots from the development sever
Header Set X-Robots-Tag "noindex, noarchive, nosnippet"

You can also do this in an htaccess file to add this header in a more granular scenario.

What do you think? Agree, disagree, or duh? :)

[Image by  giorgio raffaelli]

Comments

Both robots.txt and

Both robots.txt and sitemap.xml are recommendations. Usually you need to configure 403 HTTP status for non-public data.

Nice post. It's a duh and a

Nice post. It's a duh and a jeh all in once. We see a lot of things go wrong with robots.txt and sitemaps because somehow all website developers, hobbyist and pro alike, need to do some seo it seems. They then turn to the usual suspects that are top of mind resulting in people putting in regular expressions in the robots as well. So the want to disallow a url and end up take down a whole tag or category with 40 valuable pages on it.

Truth of the matter is that the real masters of the crawl and indexing process do not use robots.txt nor a xml sitemap. A well designed and structured site should provide ample breadcrumbs for the spider to follow starting with an effective menu and navigation structure on the front page for every website as not doing this causes a chain reaction like effect on the indexability of a site as a whole. When we do not cut corners we manage to get millions of pages indexed in the first year for new websites. How? With with followed links on to targets we choose and for the rest it’s never cutting corners.

Commands in a robots.txt file

Commands in a robots.txt file are requests, not commands, to web crawlers, which they hopefully choose to honor.

However malicious crawlers will actually look for the locations of hidden content in robots.txt - which they would otherwise not have discovered if the robots.txt file was not present.

That is why I believe it is better not to specify the location of hidden pages using robots.txt, but instead with a meta robots tag with a noindex value (or other solutions f you have access to server configuration.)

so what if i submitted both

so what if i submitted both xml sitemap and submitted another sitemap (for videos for example) in robots.txt? are both of them gonna be indexed? or one will override the other? here's exactly my confusion...

i actually have my sitemap in the root of my site, and i am in process now for making my content rich by adding videos and implement video SEO for it, so the video seo tool is advising to add 1 line to robots.txt to point to the sitemap that is hosted on their servers for videos, i am not sure what to do now.. i wish if i have separate sitemaps, but not to index them in one file, i want to keep the default way for videos to be added to robots and leave my other sitemap as it is, how could i perform that?