Improve robots.txt, avoiding tab pages for the same page

Submitted by Eelco Visser on 18 September 2013 at 13:33

On 18 September 2013 at 13:33 Eelco Visser tagged 65

On 20 September 2013 at 15:03 Elmer van Chastelet commented:

robots.txt only allows to deny access to particular paths and its subpaths, and does not support wildcard or regex patterns.

So we cannot deny crawling pages using patterns such as http://researchr.org/publication/****/references with **** being a wildcard.

Alternatives:

  • Build and maintain a sitemap index, but this requires quite some work, and is candidate for a WebDSL abstraction.
    http://dynamical.biz/blog/seo-technical/sitemap-strategy-large-sites-17.html
  • Add rel="nofollow" to the links we do not want to be crawled. Can probably also become a WebDSL abstraction where you can define patterns for pages, or patterns for page arguments that should add a rel="nofollow" attribute to links that match these patterns.

On 25 October 2013 at 11:13 Elmer van Chastelet removed tag 65

On 25 October 2013 at 11:13 Elmer van Chastelet tagged 66

On 18 December 2014 at 15:59 Eelco Visser removed tag 66

Log in to post comments