[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [Libevent-users] deadlock in callbacks



On Wed, Feb 18, 2015 at 03:56:11PM -0800, Marco Graziano wrote:
> On Wed, Feb 18, 2015 at 8:43 AM, Marco Graziano <marco@xxxxxxxxxxxxxxxx>
> wrote:
> 
> > On Tue, Feb 17, 2015 at 4:32 AM, Azat Khuzhin <a3at.mail@xxxxxxxxx> wrote:
> >
> >> On Thu, Feb 12, 2015 at 09:16:20AM -0800, Marco Graziano wrote:
> >> > Hi Azat,
> >>
> >> Hi Marco,
> >>
> >> >
> >> > The backtraces I included are from the only threads waiting while
> >> holding a
> >> > lock. The other threads are providing convincing signs of good health in
> >> > the trace log.
> >>
> >> Could you please post backstraces for all available threads?
> >> (I hope you don't have 100 of them).
> >
> >
> Hi Azat,
> 
> As an experiment, I removed all the additional locks and set only the flags
>  BEV_OPT_CLOSE_ON_FREE |  BEV_OPT_THREADSAFE
> 
> I get very quickly into a deadlock with the following backtraces:
> 
> (gdb) bt
> #0  __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
> #1  0x00007ffff7770672 in _L_lock_953 () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #2  0x00007ffff77704da in __GI___pthread_mutex_lock (mutex=0x6399f0) at
> ../nptl/pthread_mutex_lock.c:114

Hi Marco,

First of all, could post exact commit's hash, to make it specific.

Could you check you program under valgrind memcheck/{drd,helgrind} (but
with you internal locks, to reduce amount of lines in report) ?
Maybe you forgot to release lock somewhere else...

> #3  0x00007ffff7b9686e in evbuffer_get_length (buffer=0x6398d0) at
> buffer.c:610
> #4  0x0000000000404723 in device_write_cb (bev=0x64f350, arg=0x639ca0) at
> cometa.c:672
> #5  0x00007ffff7ba2139 in bufferevent_trigger_nolock_ (options=0, iotype=4,
> bufev=0x64f350) at bufferevent-internal.h:369
> #6  bufferevent_writecb (fd=<optimized out>, event=<optimized out>,
> arg=0x64f350) at bufferevent_sock.c:297
> #7  0x00007ffff7ba7a19 in event_persist_closure (ev=<optimized out>,
> base=0x6218d0) at event.c:1531
> #8  event_process_active_single_queue (base=base@entry=0x6218d0,
> activeq=0x621d20, max_to_process=max_to_process@entry=2147483647,
>     endtime=endtime@entry=0x0) at event.c:1590
> #9  0x00007ffff7ba82cf in event_process_active (base=0x6218d0) at
> event.c:1689
> #10 event_base_loop (base=0x6218d0, flags=0) at event.c:1912
> #11 0x000000000040a4b0 in main () at cometa.c:1579
> (gdb) t 2
> [Switching to thread 2 (Thread 0x7ffff5b30700 (LWP 28878))]
> #0  0x00007ffff68976a3 in epoll_wait () at
> ../sysdeps/unix/syscall-template.S:81
> 81 ../sysdeps/unix/syscall-template.S: No such file or directory.
> (gdb) bt
> #0  0x00007ffff68976a3 in epoll_wait () at
> ../sysdeps/unix/syscall-template.S:81
> #1  0x00007ffff7bb19a8 in epoll_dispatch (base=0x7ffff0000900,
> tv=<optimized out>) at epoll.c:463
> #2  0x00007ffff7ba8106 in event_base_loop (base=0x7ffff0000900, flags=0) at
> event.c:1898
> #3  0x0000000000409faf in device_loop (arg=0x0) at cometa.c:1445
> #4  0x00007ffff776e182 in start_thread (arg=0x7ffff5b30700) at
> pthread_create.c:312
> #5  0x00007ffff689700d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> (gdb) t 3
> [Switching to thread 3 (Thread 0x7ffff532f700 (LWP 28879))]
> #0  0x00007ffff68976a3 in epoll_wait () at
> ../sysdeps/unix/syscall-template.S:81
> 81 in ../sysdeps/unix/syscall-template.S
> (gdb) bt
> #0  0x00007ffff68976a3 in epoll_wait () at
> ../sysdeps/unix/syscall-template.S:81
> #1  0x00007ffff7bb19a8 in epoll_dispatch (base=0x7fffe8000900,
> tv=<optimized out>) at epoll.c:463
> #2  0x00007ffff7ba8106 in event_base_loop (base=0x7fffe8000900, flags=0) at
> event.c:1898
> #3  0x0000000000409faf in device_loop (arg=0x0) at cometa.c:1445
> #4  0x00007ffff776e182 in start_thread (arg=0x7ffff532f700) at
> pthread_create.c:312
> #5  0x00007ffff689700d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> (gdb) t 4
> [Switching to thread 4 (Thread 0x7ffff4b2e700 (LWP 28880))]
> #0  0x00007ffff68976a3 in epoll_wait () at
> ../sysdeps/unix/syscall-template.S:81
> 81 in ../sysdeps/unix/syscall-template.S
> (gdb) bt
> #0  0x00007ffff68976a3 in epoll_wait () at
> ../sysdeps/unix/syscall-template.S:81
> #1  0x00007ffff7bb19a8 in epoll_dispatch (base=0x7fffc0000900,
> tv=<optimized out>) at epoll.c:463
> #2  0x00007ffff7ba8106 in event_base_loop (base=0x7fffc0000900, flags=0) at
> event.c:1898
> #3  0x0000000000409faf in device_loop (arg=0x0) at cometa.c:1445
> #4  0x00007ffff776e182 in start_thread (arg=0x7ffff4b2e700) at
> pthread_create.c:312
> #5  0x00007ffff689700d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> (gdb) t 5
> [Switching to thread 5 (Thread 0x7fffeffff700 (LWP 28881))]
> #0  __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
> 135 ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or
> directory.
> (gdb) bt
> #0  __lll_lock_wait () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
> #1  0x00007ffff7770672 in _L_lock_953 () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #2  0x00007ffff77704da in __GI___pthread_mutex_lock (mutex=0x64b320) at
> ../nptl/pthread_mutex_lock.c:114
> #3  0x00007ffff7b9d445 in bufferevent_setcb (bufev=0x64f350,
> readcb=0x405022 <evhttp_read_cb>, writecb=0x4051ce <evhttp_write_cb>,
>     eventcb=0x405129 <evhttp_event_cb>, cbarg=0x64e690) at bufferevent.c:372
> #4  0x0000000000405ecb in send_http_response (connection=0x64e690,
> response_buffer=0x7fffbc0008c0) at cometa.c:977

What this function do?
But let's start from valgrind report first.

> #5  0x00000000004040ef in device_chunk_cb (device=0x639ca0) at cometa.c:471

Hang on..

Do you lock this device->pending->bufev somehow (or in some inner frame)?
If so then *deadlock is possible*, iff in thread#1 evbuffer_get_length()
called with this bufev output, and this is the only issue that can
happens here, if I understand you snippets+backtraces correctly.

> #6  0x00000000004044ee in device_read_cb (bev=0x6396d0, arg=0x639ca0) at
> cometa.c:607
> #7  0x00007ffff7ba2324 in bufferevent_trigger_nolock_ (options=0, iotype=2,
> bufev=0x6396d0) at bufferevent-internal.h:366
> #8  bufferevent_readcb (fd=35, event=<optimized out>, arg=0x6396d0) at
> bufferevent_sock.c:187
> #9  0x00007ffff7ba7a19 in event_persist_closure (ev=<optimized out>,
> base=0x7fffbc000900) at event.c:1531
> #10 event_process_active_single_queue (base=base@entry=0x7fffbc000900,
> activeq=0x7fffbc000d50,
>     max_to_process=max_to_process@entry=2147483647, endtime=endtime@entry=0x0)
> at event.c:1590
> #11 0x00007ffff7ba82cf in event_process_active (base=0x7fffbc000900) at
> event.c:1689
> #12 event_base_loop (base=0x7fffbc000900, flags=0) at event.c:1912
> #13 0x0000000000409faf in device_loop (arg=0x0) at cometa.c:1445
> #14 0x00007ffff776e182 in start_thread (arg=0x7fffeffff700) at
> pthread_create.c:312
> #15 0x00007ffff689700d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

(gdb) thread apply all bt

> 
> Adding the BEV_OPT_DEFER_CALLBACKS seems to improve considerably and I
> haven't got a deadlock yet.
> 
> At some point I would have to put some locks back in for synchronization
> though. I have occasional buffer corruption possibly from interleaved read
> callbacks from different threads.
> 
> Thanks,
> 
> -Marco G.
> 
> 
> 
> > Hi Azat,
> >
> > The other 2 threads are idle:
> >
> > #0  0x00007ffff68976a3 in epoll_wait () at
> > ../sysdeps/unix/syscall-template.S:81
> > #1  0x00007ffff7bb19a8 in epoll_dispatch (base=0x6218d0, tv=<optimized
> > out>) at epoll.c:463
> > #2  0x00007ffff7ba8106 in event_base_loop (base=0x6218d0, flags=0) at
> > event.c:1898
> > #3  0x000000000040a763 in main () at cometa.c:1583
> >
> >
> >> >
> >> > Definitely the write callback was called tens of thousand of times
> >> before
> >> > incurring in the deadlock, and I suspect luck or chance has little to do
> >> > with the deadlock. It may have more to do with the fact that the read
> >> > callback receiving a message send it down a bufferevent with the write
> >> > callback handled by another thread?
> >>
> >
> > Any implications having the read callback receiving a message than is then
> > sent to a bufferevent and the write callback for that message being on a
> > different thread?

I can't imagine any problems with this use case.

Even with your internal locks, the *first* lock that must be acquired --
bufferevent lock, and it will block if some of threads already holding
it (there were some issues with freeing bufferevents, but they was
before finalizers, hence in 2.1 upstream they mustn't exist anymore).

Cheers,
Azat.

> >
> >
> >> >
> >> > The fact is that after adding BEV_OPT_UNLOCK_CALLBACKS the deadlock has
> >> not
> >> > appeared again. Before it took about one hour of stress test to appear.
> >> If
> >> > removing the callback lock solved the problem, please help me understand
> >> > what was happening in the first place and if for instance the added lock
> >> > inside the callback was unnecessary or even the cause of the deadlock
> >> > itself.
> >
> >
> > After additional testing, I confirm that the deadlock has not appeared
> > when setting BEV_OPT_UNLOCK_CALLBACKS.
> >
> > >
> >> > Also, since enabling OpenSSL is coming next, I was wondering if you
> >> have a
> >> > word of wisdom or two, on how to better tackle the task at hands or
> >> which
> >> > version of libevent is best suited for the job. I am using the head of
> >> the
> >> > tree, or to do some debugging if that it was it takes, I would be okay
> >> with
> >> > it. I am afraid that locking the callbacks may be needed with OpenSSL
> >> > enabled.
> >>
> >> I prefer to use the latest versions, to be up-to-date.
> >> bufferevent_openssl doesn't require bufferevent locking - look at
> >> bufferevent_openssl_new_impl() in bufferevent_openssl.c
> >>
> >
> > Hence using  BEV_OPT_UNLOCK_CALLBACKS seems the way to go anyway.
> >
> > Thanks,
> >
> > -Marco G.
> >
> > >
> >> > Thanks for the useful insights.
> >> >
> >> > -Marco G.
> >>
> >> And for the next time: could you avoid top-posting please.
> >>
> >
> > ps. Sorry for that. Using GMail makes it too easy to forget.
> >
> >
> >>
> >> Cheers,
> >> Azat.
> >>
> >> >
> >> > On Wed, Feb 11, 2015 at 3:11 PM, Azat Khuzhin <a3at.mail@xxxxxxxxx>
> >> wrote:
> >> >
> >> > > On Tue, Feb 10, 2015 at 09:06:40PM -0800, Marco Graziano wrote:
> >> > > > I am running into a deadlock with both libevent 2.0 and 2.1 that I
> >> cannot
> >> > > > explain. One thread is executing a write callback and the other
> >> thread a
> >> > > > read callback of a different bufferevent. Since they are both
> >> accessing
> >> > > the
> >> > > > same data structure passed to the callbacks (arg=0x630c40), I am
> >> using a
> >> > > > rwlock in the data structure. I did not expect though the evbuffer
> >> > > internal
> >> > > > lock to interfere with the lock used to synchronize access to the
> >> data
> >> > > > structure shared between the callbacks.
> >> > > >
> >> > > > I am including the backtraces and the relevant portions of code. I
> >> would
> >> > > > appreciate if someone can give me any suggestion on how to work
> >> around
> >> > > this.
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > -Marco G.
> >> > > >
> >> > > > (gdb) t 1
> >> > > > [Switching to thread 1 (Thread 0x7ffff7fdd780 (LWP 12008))]
> >> > > > (gdb) bt
> >> > > > #0  __lll_lock_wait () at
> >> > > > ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
> >> > > > #1  0x00007ffff7770672 in _L_lock_953 () from
> >> > > > /lib/x86_64-linux-gnu/libpthread.so.0
> >> > > > #2  0x00007ffff77704da in __GI___pthread_mutex_lock
> >> (mutex=0x620200) at
> >> > > > ../nptl/pthread_mutex_lock.c:114
> >> > > > #3  0x00007ffff7b9686e in evbuffer_get_length (buffer=0x630930) at
> >> > > > buffer.c:610
> >> > >
> >> > > I don't sure that this two backtraces could have deadlock. Do you?
> >> > > Could it call write callback again and again, but you was so lucky
> >> that
> >> > > you always see such backtrace?
> >> > >
> >> > > Maybe some other thread acquire some locks in other order?
> >> > > (I bet to this one).
> >> > >
> >> > > Also did you check program under memcheck/drd/helgrind?
> >> > > Again if you was so *lucky* that you overwrite lock of evbuffer by
> >> your
> >> > > own, but it is so unlikely, that you could just ignore this.
> >> > >
> >> > > And if you could write 50-100L sample to reproduce the issue, it will
> >> be
> >> > > very helpful.
> >> > >
> >> > > Cheers,
> >> > > Azat.
> >> > >
> >> > > > #4  0x000000000040494b in device_write_cb (bev=0x123f640,
> >> arg=0x630c40)
> >> > > at
> >> > > > cometa.c:667
> >> > > > #5  0x00007ffff7ba2139 in bufferevent_trigger_nolock_ (options=0,
> >> > > iotype=4,
> >> > > > bufev=0x123f640) at bufferevent-internal.h:369
> >> > > > #6  bufferevent_writecb (fd=<optimized out>, event=<optimized out>,
> >> > > > arg=0x123f640) at bufferevent_sock.c:297
> >> > > > #7  0x00007ffff7ba7a19 in event_persist_closure (ev=<optimized out>,
> >> > > > base=0x6218d0) at event.c:1531
> >> > > > #8  event_process_active_single_queue (base=base@entry=0x6218d0,
> >> > > > activeq=0x621d20, max_to_process=max_to_process@entry=2147483647,
> >> > > >     endtime=endtime@entry=0x0) at event.c:1590
> >> > > > #9  0x00007ffff7ba82cf in event_process_active (base=0x6218d0) at
> >> > > > event.c:1689
> >> > > > #10 event_base_loop (base=0x6218d0, flags=0) at event.c:1912
> >> > > > #11 0x000000000040a700 in main () at cometa.c:1571
> >> > > >
> >> > > >
> >> > > > bufferevent 0x123f640 write callback:
> >> > > > 665    pthread_rwlock_wrlock(&device->lock);
> >> > > > 666    if ((device->pending && device->state == WAITING_STATE) ||
> >> > > > device->ws_pending) {
> >> > > > 667     int l = evbuffer_get_length(device_bev);
> >>
> >> How device_dev is installed?
> >> (evbuffer_new()/evbuffer_enable_locking())
> >>
> >> > > > 668     pthread_rwlock_unlock(&device->lock);
> >> > > > 669     output = bufferevent_get_output(device->pending->bufev);
> >> > > > 670 if (evbuffer_get_length(output) == 0) {
> >> > > > 671 /* free and close the connection */
> >> > > > 672 pthread_rwlock_wrlock(&device->lock);
> >> > > > 673     free_http_connection(device->pending);
> >> > > > 674     device->pending = NULL;
> >> > > > 675     device->state = IDLE_STATE;
> >> > > > 676   device->chunk_length = 0;
> >> > > > 677     device->ntoread = -1;
> >> > > > 678
> >> > > > 679     evbuffer_drain (device_bev, l);
> >> > > > 680   sem_post(&device->active);
> >> > > > 681   pthread_rwlock_unlock(&device->lock);
> >> > > > 682     }
> >> > > > 683    } else {
> >> > > > 684     pthread_rwlock_unlock(&device->lock);
> >> > > > 685    }
> >> > > >
> >> > > >
> >> > > > (gdb) t 3
> >> > > > [Switching to thread 3 (Thread 0x7ffff532f700 (LWP 12013))]
> >> > > > (gdb) bt
> >> > > > #0  pthread_rwlock_wrlock () at
> >> > > > ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:85
> >> > > > #1  0x0000000000404667 in device_read_cb (bev=0x630730,
> >> arg=0x630c40) at
> >> > > > cometa.c:604
> >> > > > #2  0x00007ffff7ba2324 in bufferevent_trigger_nolock_ (options=0,
> >> > > iotype=2,
> >> > > > bufev=0x630730) at bufferevent-internal.h:366
> >> > > > #3  bufferevent_readcb (fd=34, event=<optimized out>, arg=0x630730)
> >> at
> >> > > > bufferevent_sock.c:187
> >> > > > #4  0x00007ffff7ba7a19 in event_persist_closure (ev=<optimized out>,
> >> > > > base=0x7fffe8000900) at event.c:1531
> >> > > > #5  event_process_active_single_queue (base=base@entry
> >> =0x7fffe8000900,
> >> > > > activeq=0x7fffe8000d50,
> >> > > >     max_to_process=max_to_process@entry=2147483647,
> >> > > endtime=endtime@entry=0x0)
> >> > > > at event.c:1590
> >> > > > #6  0x00007ffff7ba82cf in event_process_active
> >> (base=0x7fffe8000900) at
> >> > > > event.c:1689
> >> > > > #7  event_base_loop (base=0x7fffe8000900, flags=0) at event.c:1912
> >> > > > #8  0x000000000040a1ff in device_loop (arg=0x0) at cometa.c:1437
> >> > > > #9  0x00007ffff776e182 in start_thread (arg=0x7ffff532f700) at
> >> > > > pthread_create.c:312
> >> > > > #10 0x00007ffff689700d in clone () at
> >> > > > ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> >> > > >
> >> > > > bufferevent 0x630730 read call back:
> >> > > > 604 pthread_rwlock_wrlock(&device->lock);
> >> > > > 605    if (device->state != WAITING_STATE) {
> >> > > > 606
> >> > > > 607   device->chunk_length = 0;
> >> > > > 608     device->ntoread = -1;
> >> > > > 609     /* in case of state == WAITING_STATE the counters reset is
> >> done
> >> > > in
> >> > > > the write callback after sending the response */
> >> > > > 610    }
> >> > > > 611 pthread_rwlock_unlock(&device->lock);
> >> ***********************************************************************
> >> To unsubscribe, send an e-mail to majordomo@xxxxxxxxxxxxx with
> >> unsubscribe libevent-users    in the body.
> >>
> >
> >
***********************************************************************
To unsubscribe, send an e-mail to majordomo@xxxxxxxxxxxxx with
unsubscribe libevent-users    in the body.