[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [tor-dev] GSOC14 Idea



On 02/27/2014 03:14 AM, Roger Dingledine wrote:
> On Sun, Feb 23, 2014 at 05:38:23PM +0530, Devang Thakkar wrote:
>>       Its Devang here, a coding enthusiast studying at IIT Bombay. I am
>> looking forward to contribute to Tor for the upcoming Google Summer of Code
>> 2014 as a prospective student. So I wanted to know if there was a provision
>> for Web Scraping using Tor. If there is, I would to know more about it or
>> if there isn't, is it a feasible Summer of Code project?
> 
> Web scraping using Tor is usually regarded as a bad thing -- first
> because it loads down the Tor network much more than normal browsing,
> and second because it makes destination websites more likely to get angry
> with Tor. For example, when Bing starts scraping Google over Tor in order
> to improve their search results, Google responds by making it harder to
> crawl Google over Tor, which impacts normal Tor users reaching Google too.
> 
> So I think we'd be happy to have a project on how to make website scraping
> through Tor less damaging to destinations and thus to users, but I think
> we're unlikely to find a "make it easier to scrape websites through Tor"
> project exciting.

Inconveniently enough, scraping websites (and hidden services) over Tor
is exactly what a lot of the CMU Tor-related research involves.  We have
developed a few in-house tools for it (none of which are anywhere close
to turnkey).  We haven't put any serious thought into making it "less
damaging to destinations," but I think we would be interested in helping
with a project along those lines.  Offhand I dunno if there's so much
code as best practices documentation needed, though (what's an
appropriate level of rate limiting, you really ought to run a private
entry node, that sort of thing...)

zw

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
tor-dev mailing list
tor-dev@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev