[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [tor-dev] prop224: HSDir caches question with OOM

To: tor-dev@xxxxxxxxxxxxxxxxxxxx
Subject: Re: [tor-dev] prop224: HSDir caches question with OOM
From: David Goulet <dgoulet@xxxxxxxxx>
Date: Sat, 16 Apr 2016 09:11:20 -0400
Delivered-to: archiver@xxxxxxxx
Delivery-date: Sat, 16 Apr 2016 09:12:14 -0400
In-reply-to: <B6254C6B-5ABD-439C-9BAA-D59C1906FE1E@xxxxxxxxx>
List-archive: <http://lists.torproject.org/pipermail/tor-dev/>
List-help: <mailto:tor-dev-request@lists.torproject.org?subject=help>
List-id: discussion regarding Tor development <tor-dev.lists.torproject.org>
List-post: <mailto:tor-dev@lists.torproject.org>
List-subscribe: <https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev>, <mailto:tor-dev-request@lists.torproject.org?subject=subscribe>
List-unsubscribe: <https://lists.torproject.org/cgi-bin/mailman/options/tor-dev>, <mailto:tor-dev-request@lists.torproject.org?subject=unsubscribe>
References: <20160415194742.GJ6139@raoul> <B6254C6B-5ABD-439C-9BAA-D59C1906FE1E@xxxxxxxxx>
Reply-to: tor-dev@xxxxxxxxxxxxxxxxxxxx
Sender: "tor-dev" <tor-dev-bounces@xxxxxxxxxxxxxxxxxxxx>

On 16 Apr (12:49:44), Tim Wilson-Brown - teor wrote:
> 
> > On 16 Apr 2016, at 05:47, David Goulet <dgoulet@xxxxxxxxx> wrote:
> > 
> > Hi!
> > 
> > (For the following, I'm only talking about HS directory.)
> > 
> > Here is my conundrum. I was working on plugging the new cache for proposal 224
> > into our OOM. I stumbled upon a fun problem.
> > 
> > We plan to have both current and new HS protocol in parallel. This means two
> > caches for both version 2 (current) and 3 (prop224). So far so good.
> > 
> > However, here is the problem. Now, our OOM handlers is called in:
> > cell_queues_check_size(). The heuristic is basically:
> > 
> >      /* If we're spending over 20% of the memory limit on hidden service
> >       * descriptors, free them until we're down to 10%.  */
> > 
> > Considering we have _two_ caches now, we somehow have to cleanup both of them
> > "evenly" that is if we have to remove 20kb we either remove 10kb on each (if
> > possible) or we actually clean them by entry expiry time instead regardless of
> > which version.
> > 
> > I think the latter is what we want but then it complicates things a bit since
> > both cache contains _different_ data struct. and indexed with different keys
> > (DIGEST vs DIGEST256). One option here is to go over all entries in both cache,
> > create temporary objects and put them in a sorted list (by expiry time). Those
> > "temp object" would contain something like: 1) cache version, 2) expiry time
> > and 3) index key. Once done, we go over that list and delete entries until we
> > reach our number of bytes we need to remove.
> > 
> > This could work since we would do that when we are at 20% memory but still it
> > requires allocating some small objects to _clear_ our memory... Someone has
> > maybe something _better_ to propose? :)
> > 
> > (keep in mind that minimal change to the current HS code is preferable. I know
> > that unifying both caches somehow would help but that's lots more work! and
> > the new cache for prop224 will be able to support versionning for which our
> > current one doesn't, it's baked for v2.)
> > 
> > I'm asking because this will require some refactoring for which I want to make
> > sure we are in agreement on how to proceed and it doesn't end up not the right
> > solution.
> > 
> > Basically my goal is to move the OOM handler for both v2 and v3 to one single
> > function called hs_cache_handle_oom() and from which we'll do the gymnastic
> > describe above. We'll stop calling rend_cache_clean_v2_descs_as_dir()
> > explicitely after that in cell_queues_check_size().
> 
> Having code that is only executed under rare conditions is error-prone.

For the OOM handler, we have to rely on good testing...

> Allocating memory at OOM time is error-prone.

Yup, this is the part that I don't like also!

> 
> We could add cache entries to the unified list when they are created, rather than at OOM time.
> This also has the benefit of being faster than creating and sorting a large list at OOM time.
> 
> Alternately, we could iterate through each cache like this:
> Find ta, the oldest expiry time in Cache A
> Find tb, the oldest expiry time in Cache B
> Set Cache X to the cache with the oldest descriptor
> Set tx to the minimum expiry time from ta and tb
> Set ty to the maximum expiry time from ta and tb
> While tx <= ty and we are above the desired memory limit
>   Deallocate the oldest descriptor from Cache X
>   Set tx to the oldest expiry time in Cache X
> If we are still above the desired memory limit, repeat this entire process
> 
> We should add entries to a time-sorted list when they are created, otherwise this algorithm is O(n^2).
> 
> A third alternative is that we can iterate through each time period:
> Set K to the oldest expected descriptor age in hours, minus 1 hour
> Deallocate all entries from Cache A that are older than K hours
> Deallocate all entries from Cache B that are older than K hours
> Set K to K - 1 and repeat this process
> 
> This algorithm is O(Kn), which is ok as long as K is small.
> This carries a slight risk of over-deallocating cache entries. Which is OK at OOM time.
> I like this one, because it's simple, performant, and doesn't need any extra memory allocations.

I do also like this one. It's pretty simple and efficient.

Now there is a fourth alternative that Yawning proposed in #tor-dev yesterday
which is always prioritize our v2 cache in the OOM handling that is clean the
v2 before than if we have to go to the v3 cache. It would be an incentive to
"v3 is much more important than v2" kind of thing.

As he describe it, it's a bit like our tap vs ntor situation under pressure,
we prioritize ntor and drop tap if needed.

I'm still quite _unsure_ about this. The v3 will bring more memory pressure
with this second HSDir cache. And my intuition is that most users won't switch
directly to v3 but will probably have a migration path from v2 to v3 like
having the v2 onion on for X months before discontinuing it.

So losing reachability because we decide to drop v2 first could not be
desirable. But then also how often does a HSDir OOM is triggered... ?

Anyway, right now I'm leaning towards your approach teor of just using the
time-period.

More eyes on this would be great :).

Cheers!
David

> 
> Tim
> 
> Tim Wilson-Brown (teor)
> 
> teor2345 at gmail dot com
> PGP 968F094B
> ricochet:ekmygaiu4rzgsk6n
> 
> 
> 



> _______________________________________________
> tor-dev mailing list
> tor-dev@xxxxxxxxxxxxxxxxxxxx
> https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

Attachment: signature.asc
Description: PGP signature

_______________________________________________
tor-dev mailing list
tor-dev@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev

Follow-Ups:
- Re: [tor-dev] prop224: HSDir caches question with OOM
  - From: s7r

References:
- [tor-dev] prop224: HSDir caches question with OOM
  - From: David Goulet
- Re: [tor-dev] prop224: HSDir caches question with OOM
  - From: Tim Wilson-Brown - teor

Prev by Author: [tor-dev] prop224: HSDir caches question with OOM
Next by Author: Re: [tor-dev] prop224: HSDir caches question with OOM
Previous by thread: Re: [tor-dev] prop224: HSDir caches question with OOM
Next by thread: Re: [tor-dev] prop224: HSDir caches question with OOM
Index(es):
- Author
- Thread