[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [Libevent-users] assert in event_del_internal()



Thanks for the reply Azat, After a long investigation we discovered a bad memory module in the ESX server. Once that stick was replaced, we never saw the problem again. The ESX was running about 15,000 instances of our app in a test environment and we would see the crash on just a few every night. Never saw any other app or VM os crash, so I guess our app was the only one hitting that bad memory (statistically makes sense). Thanks memtestx86.

-BillÂ



On Sat, Oct 31, 2015 at 12:36 PM, Azat Khuzhin <a3at.mail@xxxxxxxxx> wrote:
On Wed, Feb 25, 2015 at 09:25:03AM -0500, Bill Vaughan wrote:
> Hi all,
>
> We have been seeing a fairly hard to reproduce problem with crashes inside
> of libevent in our overnight scale tests. We run 15,000 clients in an ESX
> environment and have been getting one or two crashes per night. We are
> running libevent 2.0.21 on this test system. We have moved to 2.0.22 on our
> development branch.
>
> The application is relatively large and complex and it is quite possible we
> have a memory trampler, but valgrind tests so far are clean and code
> inspection hasn't found anything.

What valgrind tools do you use?

AFAIK valgrind (at least memcheck) limited to one thread, i.e. it
emulates multi-threading (and you have multi-threaded app):
 http://valgrind.org/docs/manual/manual-core.html#manual-core.pthreads_perf_sched
  The main thing to point out with respect to threaded programs is
  that your program will use the native threading library, but
  Valgrind serialises execution so that only one (kernel) thread is
  running at a time. This approach avoids the horrible implementation
  problems of implementing a truly multithreaded version of Valgrind,
  but it does mean that threaded apps never use more than one CPU
  simultaneously, even if you have a multiprocessor or multicore
  machine.

Can you try to use gcc/clang asan/tsan instead:
$ gcc -fsanitize=thread ...
$ gcc -fsanitize=address ...

> We recently ran the overnight test with libevent debugging enabled
> (event_enable_debug_mode(), evthread_enable_lock_debuging()) and had the
> following assert in the libevent debug hash table:
>
>
> [err] event_del_internal: noting a del on a non-setup event 0xb56248a0
> (events: 0x0, fd: -1, flags: 0x80)
> === Command detached from window (Mon Feb 23 23:13:14 2015) ===
> === Command terminated with signal 6 (core file generated) (Mon Feb 23
> 23:13:14 2015) ===
>
>
> Using host libthread_db library "/lib/i386-linux-gnu/libthread_db.so.1".
> Core was generated by `./vtc_app -Script-Client29-333'.
> Program terminated with signal 6, Aborted.
> #0Â 0xb775e424 in __kernel_vsyscall ()
> (gdb) where
> #0Â 0xb775e424 in __kernel_vsyscall ()
> #1Â 0xb747a1df in raise () from /lib/i386-linux-gnu/libc.so.6
> #2Â 0xb747d825 in abort () from /lib/i386-linux-gnu/libc.so.6
> #3Â 0x0814bc31 in event_exit (errcode=-559030611) at log.c:79
> #4Â 0x0814bd53 in event_errx (eval=-559030611,
>Â Â Âfmt=0x82105b0 "%s: noting a del on a non-setup event %p (events: 0x%x,
> fd: %d, flags: 0x%x)") at log.c:136
> #5Â 0x0813dc07 in event_del_internal (ev=<optimized out>) at event.c:2265
> #6Â event_process_active_single_queue (activeq=<optimized out>,
> base=<optimized out>) at event.c:1324
> #7Â event_process_active (base=<optimized out>) at event.c:1420
> #8Â event_base_loop (base=0xb6a09d58, flags=2) at event.c:1621
> #9Â 0x08073960 in VtcCommMgr::ThreadLoop (this=0xacdcc008) at
> vtc_app/vtc_comm_mgr.cpp:844
> #10 0xb772cd4c in start_thread () from /lib/i386-linux-gnu/libpthread.so.0
> #11 0xb753ad3e in clone () from /lib/i386-linux-gnu/libc.so.6
>
>
> So looks like an event fired that was not in the hash table, which should
> only happen if the event was freed, from what I can tell. This points to a
> clear application bug, however, there is a bit more to the store.
>
> Examining the libevent hash table that tracks the events, we found this:
>
> == 667 0xb5e248f0 ==
>Â Â Â {node = {hte_next = 0x0}, ptr = 0x0, added = 0}
> global_debug_map.hth_table[666].node.hte_next
> == 668 0xb5600858 ==
>Â Â Â {node = {hte_next = 0x0}, ptr = 0xb5600808, added = 1}
> global_debug_map.hth_table[667].node.hte_next
> == 669 0xb5624950 ==
>Â Â Â {node = {hte_next = 0x0}, ptr = 0xb5624900, added = 1}
> global_debug_map.hth_table[668].node.hte_next
> == 676 0x97a5390 ==
>Â Â Â {node = {hte_next = 0x0}, ptr = 0x97a50a8, added = 0}
> global_debug_map.hth_table[669].node.hte_next
>
>
>
> Bucket 667 has an entry, but the pointer to the event is 0. The hash entry
> should be deleted (mm_free()) when the event is freed, and have a valid
> event pointer otherwise, so the event likely was not freed. All of the
> other entries in the hash look good. The event that is currently being

If you are writing in the area of memory into which you must not write,
then you can't trust all this, since this can be a side effect, not the
core issue.

> processed in the backtrace, 0xb56248a0, maps to bucket 667 based on the
> hash algorithm:Â ((0xb56248a0 >> 6) % 769) == 667, which is our bucket.
> This appears to be some sort of corruption if I am understanding the hash
> table logic correctly.
>
> The event base was good and our application only calls event_free() and not
> any of the other APIs that could free an event from the hash
> (event_reinit() and a couple others).
>
> I see in the 2.0.22 release notes that a race was fixed in event_active().
> Do you think that could cause this? Our next step is to move to 2.0.22 on
> our test system to see if the problem still happens. However, this branch
> of code is very close to a release, so we are very cautious about changing
> things without a clear cause and effect.

Not sure, but of course this could impact.
And since it's been awhile maybe you already tried 2.0.22?

> We have had several engineers reviewing the application code and so far
> have not seen any invalid libevent usages, and as I said, valgrind is
> clean. Clearly we are doing something wrong in the application or have
> found some sort of edge case within the library. And libevent 2.0.21 has
> been very stable as I have used it on this project and others.
>
> The application is complicated enough that I can't easily provide code
> snippets that are meaningful, and since we are not sure how to reproduce
> the problem, I don't have any test code that can reproduce it.
>
> All of the clients are running under Ubuntu 12.04.
>
> Has anyone else seen anything like this?

Sorry for such a long delay.
***********************************************************************
To unsubscribe, send an e-mail to majordomo@xxxxxxxxxxxxx with
unsubscribe libevent-users  in the body.