When a page contains action links, a

pagelocation/?__action__link__=1

page is generated and is accessible through an HTTP GET request. Google is capable of discovering these pages. The page is in itself not harmful. However, the generated action_link page, is full of double escaped (and thereby invalid) URLS. Examples are

https://yellowgrass.org/%5C%22/registerUser%5C%22 

and

https://yellowgrass.org/%5C%22http://nixos.org/hydra%5C%22. 

Consequently, Google’s page crawler encounters a long list of HTTP 400 errors, which causes it to stop indexing your page. Furthermore, it causes whatever pages that have been indexed, to get the lowest page rank, due to all the bad links on your site.

I guess the two simplest solutions to the problem are: Do not make the action_link page available through HTTP GET requests and exclude syntax incorrect urls from crawling. I suppose the first is crucial, yet may not always be sufficient. So, the best approach would in my opinion be to do both. To accomplish the latter, include a robots.txt in the template for new WebDSL projects. An example robots.txt (which I use and seems to work) would be:

User-agent: *
Disallow: /*__action__link__*
Disallow: /*%5C%22*
Allow: /

Note that the second disallow would not actually be needed, when the first is not crawled. However, it is needed when the crawler already indexed an action_link page once. Furthermore, it may be useful for robustness (crawling approaches generally do not merely involve following URLs).

Submitted by Sander Vermolen on 22 February 2010 at 11:49

On 19 April 2011 at 09:11 Zef Hemel removed tag @zefhemel

Log in to post comments