[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [tor-bugs] #925 [Tor Relay]: Tor fails badly when accept(2) returns EMFILE or ENFILE



#925: Tor fails badly when accept(2) returns EMFILE or ENFILE
--------------------------------+-------------------------------------------
 Reporter:  riastradh           |         Type:  defect   
   Status:  new                 |     Priority:  minor    
Milestone:  Tor: 0.2.2.x-final  |    Component:  Tor Relay
  Version:  0.2.0.33            |   Resolution:  None     
 Keywords:                      |       Parent:           
--------------------------------+-------------------------------------------

Old description:

> If accept(2) in connection_handle_listener_read returns EMFILE or
> ENFILE, Tor logs a failure and returns to the event loop.  The
> listening socket remains ready for reading, however, so that Tor again
> tries to accept a connection.  This leads to tens of thousands of
> logged failures per second.  Here is an excerpt from my syslog:
>
> Feb 11 05:57:36 Tor[20415]: accept failed: Too many open files. Dropping
> incoming connection.
> Feb 11 05:57:54 last message repeated 301536 times
> Feb 11 05:57:54 Tor[20415]: Failing because we have 1765 connections
> already. Please raise your ulimit -n.
> Feb 11 05:57:54 Tor[20415]: accept failed: Too many open files. Dropping
> incoming connection.
> Feb 11 05:58:05 last message repeated 184158 times
> Feb 11 05:58:05 Tor[20415]: Failing because we have 1765 connections
> already. Please raise your ulimit -n.
> Feb 11 05:58:05 Tor[20415]: accept failed: Too many open files. Dropping
> incoming connection.
> Feb 11 05:58:13 last message repeated 127556 times
> Feb 11 05:58:13 Tor[20415]: Failing because we have 1765 connections
> already. Please raise your ulimit -n.
> Feb 11 05:58:13 Tor[20415]: accept failed: Too many open files. Dropping
> incoming connection.
> Feb 11 05:58:26 last message repeated 223556 times
> Feb 11 05:58:26 Tor[20415]: Failing because we have 1765 connections
> already. Please raise your ulimit -n.
>
> I don't know what the right thing to do here is, but spiking the CPU
> and spraying log messages is not a very graceful mode of failure.  One
> way to mitigate the damage might be to close the listening socket,
> which I believe won't be reopened until a minute later.  This is no
> worse for the Tor network than just wedging, and perhaps better, since
> prospective connectors would be refused rather than silently forgotten
> in a flurry of furious logging.
>
> Also, it would be nice to document the number of file descriptors
> generally required by a Tor relay, or a formula for computing it.  For
> example, is it proportional to the bandwidth and to the number of
> relays in the Tor network?  Or to the bandwidth and to the number of
> users in the Tor network?  This way, prospective operators of Tor
> relays would not need to repeatedly restart their relays as they test
> incremental bumps in the file descriptor ulimits, unless there is some
> way to bump them without restarting the relay (but I doubt whether
> there is).
>
> (Apologies if this is duplicated: I hit !^A while editing this, in order
> to move to the beginning of the line, but the obnoxious !@#!^%&%!^& web
> form [and my obnoxiously colluding web browser] interpreted it to mean
> something else for which I quickly hit the stop button.  I don't know
> what hitting !^A actually did.)
>
> [Automatically added by flyspray2trac: Operating System: All]

New description:

 If accept(2) in connection_handle_listener_read returns EMFILE or
 ENFILE, Tor logs a failure and returns to the event loop.  The
 listening socket remains ready for reading, however, so that Tor again
 tries to accept a connection.  This leads to tens of thousands of
 logged failures per second.  Here is an excerpt from my syslog:

 Feb 11 05:57:36 Tor[20415]: accept failed: Too many open files. Dropping
 incoming connection.
 Feb 11 05:57:54 last message repeated 301536 times
 Feb 11 05:57:54 Tor[20415]: Failing because we have 1765 connections
 already. Please raise your ulimit -n.
 Feb 11 05:57:54 Tor[20415]: accept failed: Too many open files. Dropping
 incoming connection.
 Feb 11 05:58:05 last message repeated 184158 times
 Feb 11 05:58:05 Tor[20415]: Failing because we have 1765 connections
 already. Please raise your ulimit -n.
 Feb 11 05:58:05 Tor[20415]: accept failed: Too many open files. Dropping
 incoming connection.
 Feb 11 05:58:13 last message repeated 127556 times
 Feb 11 05:58:13 Tor[20415]: Failing because we have 1765 connections
 already. Please raise your ulimit -n.
 Feb 11 05:58:13 Tor[20415]: accept failed: Too many open files. Dropping
 incoming connection.
 Feb 11 05:58:26 last message repeated 223556 times
 Feb 11 05:58:26 Tor[20415]: Failing because we have 1765 connections
 already. Please raise your ulimit -n.

 I don't know what the right thing to do here is, but spiking the CPU
 and spraying log messages is not a very graceful mode of failure.  One
 way to mitigate the damage might be to close the listening socket,
 which I believe won't be reopened until a minute later.  This is no
 worse for the Tor network than just wedging, and perhaps better, since
 prospective connectors would be refused rather than silently forgotten
 in a flurry of furious logging.

 Also, it would be nice to document the number of file descriptors
 generally required by a Tor relay, or a formula for computing it.  For
 example, is it proportional to the bandwidth and to the number of
 relays in the Tor network?  Or to the bandwidth and to the number of
 users in the Tor network?  This way, prospective operators of Tor
 relays would not need to repeatedly restart their relays as they test
 incremental bumps in the file descriptor ulimits, unless there is some
 way to bump them without restarting the relay (but I doubt whether
 there is).

 (Apologies if this is duplicated: I hit !^A while editing this, in order
 to move to the beginning of the line, but the obnoxious !@#!^%&%!^& web
 form [and my obnoxiously colluding web browser] interpreted it to mean
 something else for which I quickly hit the stop button.  I don't know
 what hitting !^A actually did.)

 [Automatically added by flyspray2trac: Operating System: All]

--

Comment(by nickm):

 So part one of the "fix" here, if we want to try it, is for
 connection_handle_listener_read() to compare get_n_open_sockets() with
 get_options->ConnLimit [grep through the rest of that file to see how we
 do it].  If we have too many sockets, then we should immediately close the
 new connection...

 ...and part two is, if we ever get an EMFILE/ENFILE, to reset our idea of
 what our connlimit is based on the number of files we currently have
 open...

 ...but first, we need to look through the code that connects to
 ORs/directories, and make sure that we don't actually treat a completed
 connect() attempt as meaning that a server is up.  I am 97% sure that we
 don't.

-- 
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/925#comment:5>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
_______________________________________________
tor-bugs mailing list
tor-bugs@xxxxxxxxxxxxxxxxxxxx
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-bugs