Update robots.txt
Improve robots.txt, avoiding tab pages for the same page
Submitted by Eelco Visser on 18 September 2013 at 13:33
Issue Log
On 18 September 2013 at 13:33 Eelco Visser tagged 65
On 20 September 2013 at 15:03 Elmer van Chastelet commented:
robots.txt
only allows to deny access to particular paths and its subpaths, and does not support wildcard or regex patterns.So we cannot deny crawling pages using patterns such as http://researchr.org/publication/****/references with
****
being a wildcard.Alternatives:
- Build and maintain a sitemap index, but this requires quite some work, and is candidate for a WebDSL abstraction.
http://dynamical.biz/blog/seo-technical/sitemap-strategy-large-sites-17.html- Add
rel="nofollow"
to the links we do not want to be crawled. Can probably also become a WebDSL abstraction where you can define patterns for pages, or patterns for page arguments that should add arel="nofollow"
attribute to links that match these patterns.
On 25 October 2013 at 11:13 Elmer van Chastelet removed tag 65
On 25 October 2013 at 11:13 Elmer van Chastelet tagged 66
On 18 December 2014 at 15:59 Eelco Visser removed tag 66
Log in to post comments