Re: [tor-talk] Hidden Services (about tor2web)

Apologise for subject/thread hijacking.

On 9/19/12 10:13 AM, tor@xxxxxxxxxxxxxxxxxx wrote:
> On 19/09/12 06:36, grarpamp wrote:
> >> People use robots.txt to indicate that they don't want their site
> >> to be added to indexes.
> > They use it to indicate that they don't want their site to be
> > crawled.
> In almost all cases (99% or higher), robots.txt is used to indicate
> that a site shouldn't be crawled, *because* they don't want it to be
> indexed. The intention is painfully clear...

The point has been integrated in the appropriate ticket there:

Please integrate here any idea or suggestion about the topic.

However you should also know that already today is possible for a TorHs
to block access from Tor2web.

Tor2web send an X-Tor2web header to announce to the TorHS that
connection come from Tor2web.

We added up a wiki documentation section explaining how to do it:

Regarding the topic of "robots.txt", in the new tor2web 3.0 robots.txt
are "hijacked" in order to prevent Tor2web crawling by public search
engine. Also a list of user agent of internet spyder has been blocked by
Both blocks settings can be disabled from config file:

Those blocks will be probably less annoying when the behavior regarding
spidering will be configurable directly from TorHs sites (for example by
providing specific tor2web related config strings in robots.txt).


p.s. There's a new tor2web domain using Tor2web 3
http://eqt5g4fuenphqinx.tor2web.blutmagie.de :-)
