[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [tor-dev] Searchable Tor descriptor and Metrics data archive - GSoC 2013

To: Praveen Kumar <praveen97uma@xxxxxxxxx>
Subject: Re: [tor-dev] Searchable Tor descriptor and Metrics data archive - GSoC 2013
From: Karsten Loesing <karsten@xxxxxxxxxxxxxx>
Date: Tue, 16 Apr 2013 08:04:46 +0200
Cc: tor-dev@xxxxxxxxxxxxxxxxxxxx
Delivered-to: archiver@xxxxxxxx
Delivery-date: Tue, 16 Apr 2013 02:05:00 -0400
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:sender:message-id:date:from:user-agent:mime-version:to :cc:subject:references:in-reply-to:x-enigmail-version:content-type :content-transfer-encoding; bh=YPufv3D0R8boSfjG+K6WuW4kYdBw6hSfNGh/0IRyWTU=; b=IwdaTa4Vf3cDTBF7qR8ZHg0vaPPpP3IaSyFdzIKiNBSxf3VAcp5eEhIAEI6smGNdpf jTXr0AmMjWSwglCarfuKM9aBNGoStGkuN0fFoWcuBHAKQRa2xqXbdXyv/jXq0HTdc+Bm 8sw65ciRH4MLcHZ8CChUxgTCrtEe62sJvehIBpojlZzfKz9pfQzxwHWpvqNUQpmaZ0TS vqmNbc0K05I1MJd201qCkSWNGNGOXmi3/+tpULQisc+tHH7ZqfyCne6kciIqq/90Wd6q osjZMKsPRbObNMTJFEZJJmnArpPERR6UkZIB2LXvxuJgA9uiW1u03sU/jDRkE/XYjqb9 tp6g==
In-reply-to: <CANeRCe6P-i2THAseh4OySDOechdU5tu5DQrL7OnTg_jx625Vhw@xxxxxxxxxxxxxx>
List-archive: <http://lists.torproject.org/pipermail/tor-dev>
List-help: <mailto:tor-dev-request@lists.torproject.org?subject=help>
List-id: discussion regarding Tor development <tor-dev.lists.torproject.org>
List-post: <mailto:tor-dev@lists.torproject.org>
List-subscribe: <https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev>, <mailto:tor-dev-request@lists.torproject.org?subject=subscribe>
List-unsubscribe: <https://lists.torproject.org/cgi-bin/mailman/options/tor-dev>, <mailto:tor-dev-request@lists.torproject.org?subject=unsubscribe>
References: <CANeRCe5nvOd7rdhuxQ7nG8APqsxs84UOeFqH=LP_0WGgcyfwsw@xxxxxxxxxxxxxx> <51667259.30306@xxxxxxxxxxxxxx> <CANeRCe6P-i2THAseh4OySDOechdU5tu5DQrL7OnTg_jx625Vhw@xxxxxxxxxxxxxx>
Reply-to: tor-dev@xxxxxxxxxxxxxxxxxxxx
Sender: tor-dev-bounces@xxxxxxxxxxxxxxxxxxxx
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:17.0) Gecko/20130328 Thunderbird/17.0.5

On 4/14/13 1:26 PM, Praveen Kumar wrote:
> Hi Karsten,

Hi Praveen,

> I am so sorry for replying late. I had a seminar presentation on Friday and
> have another
> on Monday, so I was a little busy studying for it.

No worries!

> I had downloaded about 1GB data of Server Descriptors from the metrics
> website. I thought
> of generating some performance metrics of a search application with a MySQL
> backend and
> with a MongoDB database backend in Django. So, I implemented two basic apps
> with a MySQL
> and MongoDB backend in Django. I processed each file and extracted
> router_nickname, router_ip,
> tor_version and platform_os as searchable fields for each server descriptor
> file. At the time of writing this email, I had processed around 330,000
> files for MySQL and have the data of 670,000 files in MongoDB. I can not
> process all the files as that 1GB data is composed of millions of files and
> processing is slow on my system.
> My aim is to issue same queries to both the apps and see which one performs
> better. Both the databases are
> indexed on the same fields. I will tell you the metrics day after tomorrow
> i.e on Tuesday.

Sounds like a fine start.  Be sure to include results of this
performance comparison in your GSoC application!

> But, theoretically speaking, MongoDB is fast because every document is
> stored in JSON, it is schema less and doesn't has to preform any joins etc.
> The indexes that are built are based on BTrees which have the worst case
> time complexity of O(log(n)) for insertion, lookup and deletion. MongoDB
> also keeps the indexes in RAM as required, for faster searches and to
> reduce disk reads. MongoDB also has the capability of scaling efficiently.

Well, performance of MongoDB vs. MySQL really depends on the problem
you're trying to solve.  For example, we'll have to perform joins when
storing a network status consensus that references 0..n server
descriptors each of which references 0..1 extra-info descriptors.  See
the descriptor formats page for details:

https://metrics.torproject.org/formats.html

Also, with respect to scaling, the plan would be to run this application
on a single server along with other services.

So, in general, I'd be careful with "MongoDB is fast because"
statements.  Some of them may be correct in this specific case.  But
there may also be cases where good old SQL has performance advantages
over shiny new NoSQL.

> I am now, somewhat, in favor of Django Haystack with Solr as the search
> engine. Using MongoDB will
> require us to spend considerable time developing the search interface which
> will be responsible for handling complicated queries and then create
> appropriate indices to handle those complicated queries.

Sounds good!  You should include your preliminary results in your GSoC
application, too.

Best,
Karsten

_______________________________________________
tor-dev mailing list
tor-dev@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

References:
- [tor-dev] Searchable Tor descriptor and Metrics data archive - GSoC 2013
  - From: Praveen Kumar
- Re: [tor-dev] Searchable Tor descriptor and Metrics data archive - GSoC 2013
  - From: Karsten Loesing
- Re: [tor-dev] Searchable Tor descriptor and Metrics data archive - GSoC 2013
  - From: Praveen Kumar

Prev by Author: Re: [tor-dev] Searchable Tor descriptor and Metrics data archive - GSoC 2013
Next by Author: Re: [tor-dev] Interested in getting involved with pyDoctor
Previous by thread: Re: [tor-dev] Searchable Tor descriptor and Metrics data archive - GSoC 2013
Next by thread: [tor-dev] Fwd: Fwd: Orbot feature for GSoC 2013
Index(es):
- Author
- Thread