Improve robots.txt, avoiding tab pages for the same page

Submitted by Eelco Visser on 18 September 2013 at 13:33

On 18 September 2013 at 13:33 Eelco Visser tagged 65

On 20 September 2013 at 15:03 Elmer van Chastelet commented:

robots.txt only allows to deny access to particular paths and its subpaths, and does not support wildcard or regex patterns.

So we cannot deny crawling pages using patterns such as****/references with **** being a wildcard.


  • Build and maintain a sitemap index, but this requires quite some work, and is candidate for a WebDSL abstraction.
  • Add rel="nofollow" to the links we do not want to be crawled. Can probably also become a WebDSL abstraction where you can define patterns for pages, or patterns for page arguments that should add a rel="nofollow" attribute to links that match these patterns.

On 25 October 2013 at 11:13 Elmer van Chastelet removed tag 65

On 25 October 2013 at 11:13 Elmer van Chastelet tagged 66

On 18 December 2014 at 15:59 Eelco Visser removed tag 66

Log in to post comments