Google not able to fetch robots.txt

If you find this free website useful – why don’t you support this with a donation? It is easy…. read more ….

Recently I was getting a message from Web Master Tools, Googlebot can’t access your site!

Over the last 24 hours, Googlebot encountered 87 errors while attempting to access your robots.txt. To ensure that we didn’t crawl any pages listed in that file, we postponed our crawl. Your site’s overall robots.txt error rate is 64.4%

 

Recommended action

  • If the site error rate is 100%:Using a web browser, attempt to access http://xxxxx.com/robots.txt. If you are able to access it from your browser, then your site may be configured to deny access to googlebot. Check the configuration of your firewall and site to ensure that you are not denying access to googlebot.
  • If your robots.txt is a static page, verify that your web service has proper permissions to access the file.
  • If your robots.txt is dynamically generated, verify that the scripts that generate the robots.txt are properly configured and have permission to run. Check the logs for your website to see if your scripts are failing, and if so attempt to diagnose the cause of the failure.
  • If the site error rate is less than 100%:Using Webmaster Tools, find a day with a high error rate and examine the logs for your web server for that day. Look for errors accessing robots.txt in the logs for that day and fix the causes of those errors.
  • The most likely explanation is that your site is overloaded. Contact your hosting provider and discuss reconfiguring your web server or adding more resources to your website.
  • If your site redirects to another hostname, another possible explanation is that a URL on your site is redirecting to a hostname whose serving of its robots.txt file is exhibiting one or more of these issues.

After you’ve fixed the problem, use Fetch as Google to fetch http://xxxxx.com/robots.txt to verify that Googlebot can properly access your site.

 

But when I manually looked at my robots.txt in my browser, all looked fine!

In WMT when I fetched this, it was coming back inaccessible. This bamboozled me for a little while.  I did some seraching and found others had similar issues, but no one seemed to get to the root of the problem.

So I stated to look at my .htaccess file and recalled that I was redirecting my whole site to HTTPS  ( see this post ) and though perhaps that was the issue.

The next step was to exclude robots.txt from the HTTPS redirect.  And when testing in WMT all was well and Googlebot fetched fine. My problem solved.  All I had to do was tweak my .htaccess file as follows

RewriteEngine On

# condition if https
RewriteCond %{HTTPS} off

# condition to exclude robots.txt from the condition
RewriteCond %{REQUEST_FILENAME} !robots\.txt

# rule to force https
RewriteRule (.*) https://%{HTTP_HOST}%{REQUEST_URI} [R=301,L]

 

But then, I realised perhaps this was not required.

What is really required is to just have my https://mysite.com registered in WMT not the http:// version!

Thoughts?

 


Posted

in

, ,

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *