[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [Libevent-users] Re: libevent2 stuck in Linux, CPU 100%



I tried to open/close the UDP socket. It helps. I suppose that this is a Linux UDP networking problem.

Socket flag SO_BSDCOMPAT does not help.

Reopening the socket is not very convenient, of course... but as this is an exceptional situation (file descriptors overload) then it may be OK.

Thanks
Oleg



On Fri, Jul 19, 2013 at 8:25 PM, Oleg Moskalenko <mom040267@xxxxxxxxx> wrote:
OK, I checked that...

The error returned by SO_ERROR is always 0.

The socket is actually "alive": it would accept messages if sent to.

I tried to change it to recvmsg. No changes... This is what I see from strace:

..................................
[pid 24100] recvmsg(8, 0x7fff5e4803a0, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
[pid 24100] epoll_wait(4, {{EPOLLERR, {u32=8, u64=8}}}, 32, 314) = 1
[pid 24100] clock_gettime(CLOCK_MONOTONIC, {254928, 717051324}) = 0
[pid 24100] gettimeofday({1374289798, 822877}, NULL) = 0
[pid 24100] recvmsg(8, 0x7fff5e4803a0, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
[pid 24100] epoll_wait(4, {{EPOLLERR, {u32=8, u64=8}}}, 32, 313) = 1
[pid 24100] clock_gettime(CLOCK_MONOTONIC, {254928, 718103692}) = 0
[pid 24100] gettimeofday({1374289798, 823914}, NULL) = 0
....................................

Basically, the socket goes into a "gray" state - non-dead and non-totally-alive.

I wonder if I see the results of the "new" UDP Linux weird behavior (RFC 1122) that many are complaining about, for example:

http://web.mit.edu/Ghudson/info/linux.icmp

I do not see anything like that in non-Linux *NIXes.

Does it make any sense ? I am trying to figure out how it can be fixed at all.

Thanks
Oleg





On Fri, Jul 19, 2013 at 8:28 AM, Nick Mathewson <nickm@xxxxxxxxxxxxx> wrote:
On Fri, Jul 19, 2013 at 9:31 AM, Oleg Moskalenko <mom040267@xxxxxxxxx> wrote:
> Thank you Azat for the suggestion. It seems to me that UDP sockets are
> offenders, somehow it happens only in Linux (I know Linux has some weird UDP
> behavior):
>
> Process 20828 attached with 5 threads - interrupt to quit
> [pid 20831] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
> [pid 20832] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
> [pid 20831] <... clock_gettime resumed> {205614, 271115090}) = 0
> [pid 20831] gettimeofday( <unfinished ...>
> [pid 20832] <... clock_gettime resumed> {205614, 271926086}) = 0
> [pid 20831] <... gettimeofday resumed> {1374240484, 377784}, NULL) = 0
> [pid 20832] gettimeofday( <unfinished ...>
> [pid 20831] epoll_wait(20,  <unfinished ...>
> [pid 20829] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
> [pid 20830] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
> [pid 20832] <... gettimeofday resumed> {1374240484, 378418}, NULL) = 0
> [pid 20832] epoll_wait(16,  <unfinished ...>
> [pid 20830] <... clock_gettime resumed> {205614, 273231001}) = 0
> [pid 20829] <... clock_gettime resumed> {205614, 272801617}) = 0
> [pid 20829] gettimeofday( <unfinished ...>
> [pid 20830] gettimeofday( <unfinished ...>
> [pid 20829] <... gettimeofday resumed> {1374240484, 379094}, NULL) = 0
> [pid 20829] epoll_wait(28,  <unfinished ...>
> [pid 20830] <... gettimeofday resumed> {1374240484, 379317}, NULL) = 0
> [pid 20830] epoll_wait(24,  <unfinished ...>
> [pid 20828] recvfrom(8, 0x7fff61df20c0, 4, 2, 0xa9bc20, 0x7fff61df20bc) = -1
> EAGAIN (Resource temporarily unavailable)
> [pid 20828] epoll_wait(4, {{EPOLLERR, {u32=8, u64=8}}}, 32, 19) = 1
> [pid 20828] clock_gettime(CLOCK_MONOTONIC, {205614, 277088474}) = 0
> [pid 20828] gettimeofday({1374240484, 386338}, NULL) = 0
> [pid 20828] recvfrom(8, 0x7fff61df20c0, 4, 2, 0xa9bc20, 0x7fff61df20bc) = -1
> EAGAIN (Resource temporarily unavailable)
> [pid 20828] epoll_wait(4, {{EPOLLERR, {u32=8, u64=8}}}, 32, 12) = 1
> [pid 20828] clock_gettime(CLOCK_MONOTONIC, {205614, 286419826}) = 0
> [pid 20828] gettimeofday({1374240484, 392232}, NULL) = 0
> [pid 20828] recvfrom(8, 0x7fff61df20c0, 4, 2, 0xa9bc20, 0x7fff61df20bc) = -1

Hm. So, epoll_wait is reporting EPOLLERR on fd 8.  The Libevent
epoll.c code treats EPOLLERR as (EV_READ|EV_WRITE).  But when you
recvfrom on the socket, it only says EAGAIN.

So your program sensibly decides to keep listening for events on fd 8,
and epoll keeps telling you that there was an error.

Assuming that this recvfrom is in your code, I'll echo Vsevolod's
question: what happens when you call getsockopt(...SO_ERROR...)  on
the socket in the event handler that calls the recvfrom, to see what
the queued error is?

--
Nick
***********************************************************************
To unsubscribe, send an e-mail to majordomo@xxxxxxxxxxxxx with
unsubscribe libevent-users    in the body.