[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [tor-bugs] #27410 [Core Tor/Tor]: Bug: Non-fatal assertion !(status == HS_CLIENT_FETCH_HAVE_DESC) failed in retry_all_socks_conn_waiting_for_desc at ../src/or/hs_client.c:268



#27410: Bug: Non-fatal assertion !(status == HS_CLIENT_FETCH_HAVE_DESC) failed in
retry_all_socks_conn_waiting_for_desc at ../src/or/hs_client.c:268
---------------------------+------------------------------------
 Reporter:  cstest         |          Owner:  (none)
     Type:  defect         |         Status:  new
 Priority:  Medium         |      Milestone:  Tor: 0.3.5.x-final
Component:  Core Tor/Tor   |        Version:  Tor: 0.3.3.9
 Severity:  Normal         |     Resolution:
 Keywords:  tor-hs, hs-v3  |  Actual Points:
Parent ID:                 |         Points:
 Reviewer:                 |        Sponsor:
---------------------------+------------------------------------

Comment (by dgoulet):

 I have a long theory about what can cause this problem:

 It all starts with `hs_client_desc_has_arrived()` which is called when a
 descriptor is cached (client side). That function goes over all pending
 SOCKS request that are waiting for that descriptor. It changes the state
 of the connection to "wait for circuits".

 However, there is one code path that can make it exit early which is when
 no intro points in the descriptor are usable. In that case, we'll
 unattached a SOCKS connection and exit the function. But, here comes the
 problem, there could be more SOCKS connection for that same .onion, they
 won't be unattached and ultimately wait until they timeout.

 This creates the problem that when our directory information changes, the
 HS client will go retry all SOCKS connection and see if we should refetch
 the descriptor:

 {{{
     /* Order a refetch in case it works this time. */
     status = hs_client_refetch_hsdesc(&edge_conn->hs_ident->identity_pk);
     if (BUG(status == HS_CLIENT_FETCH_HAVE_DESC)) {
       /* This case is unique because it can NOT happen in theory. Once we
 get
        * a new descriptor, the HS client subsystem is notified immediately
 and
        * the connections waiting for it are handled which means the state
 will
        * change from renddesc wait state. Log this and continue to next
        * connection. */
       continue;
     }
 }}}

 The comment explaining the BUG() is actually wrong for one edge case. In
 reality, we change the state of *one* single SOCKS connection, not all of
 them. So yes, we can end up looking at a SOCKS connection for which we
 have a descriptor where the IPs in the descriptor (that were originally
 flagged has unusable) might now be usable now (intro point failure cache
 cleanup for instance) and so the SOCKS connection state is still in
 `RENDDESC_WAIT`, we now have a usable descriptor
 (`HS_CLIENT_FETCH_HAVE_DESC`) and boom.

 I think the solution needs those two elements:

 1. In `hs_client_desc_has_arrived()`, do not exit if the descriptor is not
 usable but rather continue to loop on all SOCKS connections so we can
 detach them all.

 2. Once a descriptor arrives and stored in our cache, if it is unusable,
 get rid of it.

 For (1), it is super important that we end up with no SOCKS connection in
 `RENDDESC_WAIT` state if we know that the descriptor is unusable and there
 are no inflight directory requests for it.

 About (2), there is this issue where we'll create a way to continuously
 re-request the descriptor on the HSDirs at each SOCKS request adding
 considerable loads on the network. In theory, we shouldn't because we have
 the `last_hid_serv_requests_` cache for this meaning we can't query the
 same HSDir for the same .onion more than once in a 15 minutes time frame.
 HOWEVER, in `hs_client_desc_has_arrived()`, if the descriptor is unusable,
 we purge that cache for the .onion so we can attempt again to fetch the
 descriptor that might have changed on the HSDir
 (`note_connection_attempt_succeeded()`).

 I still think (2) worth doing but then we might want to remove that purge
 if the IPs aren't usable and bring down that retry time period to
 something more acceptable like 60 secs instead of 15 minutes which is
 crazy high for the user experience.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/27410#comment:3>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs