[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [Libevent-users] Re: libevent2 stuck in Linux, CPU 100%



I drilled down, eventually, to  the bottom of this problem.

When some kind of socket problem is reported on a UDP socket, then the EV_READ is generated but  there is no data is available. So it will go in indefinite cycle with libevent.

The solution is to use recvmsg on a socket in this condition, with flag MSG_ERRQUEUE. It will read the error message and restore the socket condition.

BSD-compatible systems do not do that, this is obviously a Linux lesser-known extension.

I am working on including this workaround into the TURN server code, but my pilot code tests shows that it works perfectly and this is the right solution.

Thanks everybody !
Oleg



On Fri, Jul 19, 2013 at 8:34 PM, Oleg Moskalenko <mom040267@xxxxxxxxx> wrote:
I tried to open/close the UDP socket. It helps. I suppose that this is a Linux UDP networking problem.

Socket flag SO_BSDCOMPAT does not help.

Reopening the socket is not very convenient, of course... but as this is an exceptional situation (file descriptors overload) then it may be OK.

Thanks
Oleg



On Fri, Jul 19, 2013 at 8:25 PM, Oleg Moskalenko <mom040267@xxxxxxxxx> wrote:
OK, I checked that...

The error returned by SO_ERROR is always 0.

The socket is actually "alive": it would accept messages if sent to.

I tried to change it to recvmsg. No changes... This is what I see from strace:

..................................
[pid 24100] recvmsg(8, 0x7fff5e4803a0, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
[pid 24100] epoll_wait(4, {{EPOLLERR, {u32=8, u64=8}}}, 32, 314) = 1
[pid 24100] clock_gettime(CLOCK_MONOTONIC, {254928, 717051324}) = 0
[pid 24100] gettimeofday({1374289798, 822877}, NULL) = 0
[pid 24100] recvmsg(8, 0x7fff5e4803a0, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
[pid 24100] epoll_wait(4, {{EPOLLERR, {u32=8, u64=8}}}, 32, 313) = 1
[pid 24100] clock_gettime(CLOCK_MONOTONIC, {254928, 718103692}) = 0
[pid 24100] gettimeofday({1374289798, 823914}, NULL) = 0
....................................

Basically, the socket goes into a "gray" state - non-dead and non-totally-alive.

I wonder if I see the results of the "new" UDP Linux weird behavior (RFC 1122) that many are complaining about, for example:

http://web.mit.edu/Ghudson/info/linux.icmp

I do not see anything like that in non-Linux *NIXes.

Does it make any sense ? I am trying to figure out how it can be fixed at all.

Thanks
Oleg





On Fri, Jul 19, 2013 at 8:28 AM, Nick Mathewson <nickm@xxxxxxxxxxxxx> wrote:
On Fri, Jul 19, 2013 at 9:31 AM, Oleg Moskalenko <mom040267@xxxxxxxxx> wrote:
> Thank you Azat for the suggestion. It seems to me that UDP sockets are
> offenders, somehow it happens only in Linux (I know Linux has some weird UDP
> behavior):
>
> Process 20828 attached with 5 threads - interrupt to quit
> [pid 20831] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
> [pid 20832] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
> [pid 20831] <... clock_gettime resumed> {205614, 271115090}) = 0
> [pid 20831] gettimeofday( <unfinished ...>
> [pid 20832] <... clock_gettime resumed> {205614, 271926086}) = 0
> [pid 20831] <... gettimeofday resumed> {1374240484, 377784}, NULL) = 0
> [pid 20832] gettimeofday( <unfinished ...>
> [pid 20831] epoll_wait(20,  <unfinished ...>
> [pid 20829] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
> [pid 20830] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
> [pid 20832] <... gettimeofday resumed> {1374240484, 378418}, NULL) = 0
> [pid 20832] epoll_wait(16,  <unfinished ...>
> [pid 20830] <... clock_gettime resumed> {205614, 273231001}) = 0
> [pid 20829] <... clock_gettime resumed> {205614, 272801617}) = 0
> [pid 20829] gettimeofday( <unfinished ...>
> [pid 20830] gettimeofday( <unfinished ...>
> [pid 20829] <... gettimeofday resumed> {1374240484, 379094}, NULL) = 0
> [pid 20829] epoll_wait(28,  <unfinished ...>
> [pid 20830] <... gettimeofday resumed> {1374240484, 379317}, NULL) = 0
> [pid 20830] epoll_wait(24,  <unfinished ...>
> [pid 20828] recvfrom(8, 0x7fff61df20c0, 4, 2, 0xa9bc20, 0x7fff61df20bc) = -1
> EAGAIN (Resource temporarily unavailable)
> [pid 20828] epoll_wait(4, {{EPOLLERR, {u32=8, u64=8}}}, 32, 19) = 1
> [pid 20828] clock_gettime(CLOCK_MONOTONIC, {205614, 277088474}) = 0
> [pid 20828] gettimeofday({1374240484, 386338}, NULL) = 0
> [pid 20828] recvfrom(8, 0x7fff61df20c0, 4, 2, 0xa9bc20, 0x7fff61df20bc) = -1
> EAGAIN (Resource temporarily unavailable)
> [pid 20828] epoll_wait(4, {{EPOLLERR, {u32=8, u64=8}}}, 32, 12) = 1
> [pid 20828] clock_gettime(CLOCK_MONOTONIC, {205614, 286419826}) = 0
> [pid 20828] gettimeofday({1374240484, 392232}, NULL) = 0
> [pid 20828] recvfrom(8, 0x7fff61df20c0, 4, 2, 0xa9bc20, 0x7fff61df20bc) = -1

Hm. So, epoll_wait is reporting EPOLLERR on fd 8.  The Libevent
epoll.c code treats EPOLLERR as (EV_READ|EV_WRITE).  But when you
recvfrom on the socket, it only says EAGAIN.

So your program sensibly decides to keep listening for events on fd 8,
and epoll keeps telling you that there was an error.

Assuming that this recvfrom is in your code, I'll echo Vsevolod's
question: what happens when you call getsockopt(...SO_ERROR...)  on
the socket in the event handler that calls the recvfrom, to see what
the queued error is?

--
Nick
***********************************************************************
To unsubscribe, send an e-mail to majordomo@xxxxxxxxxxxxx with
unsubscribe libevent-users    in the body.