[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

[Libevent-users] Re: assert in event_del_internal()



We updated the clients on our ESX test bed to use libevent 2.0.22 and so far (after 43 ours of running 15K clients) have not seen the crash. The only change was moving from libevent 2.0.21 to 2.0.22.

I am thinking the root cause may be addressed by this fix in 2.0.22 as event_active() is called heavily in a multithreaded environment.Â

o Fix race caused by event_active (3c7d6fc vjpai)

Obviously the lack of a crash does not prove it is fixed, but it is a good sign and so far we have not identified any other source.

Thanks,
-Bill


On Wed, Feb 25, 2015 at 9:25 AM, Bill Vaughan <billvaughan@xxxxxxxxx> wrote:
Hi all,

We have been seeing a fairly hard to reproduce problem with crashes inside of libevent in our overnight scale tests. We run 15,000 clients in an ÂESX environment and have been getting one or two crashes per night. We are running libevent 2.0.21 on this test system. We have moved to 2.0.22 on our development branch.

The application is relatively large and complex and it is quite possible we have a memory trampler, but valgrind tests so far are clean and code inspection hasn't found anything.

We recently ran the overnight test with libevent debugging enabled (event_enable_debug_mode(), evthread_enable_lock_debuging()) and had the following assert in the libevent debug hash table:


[err] event_del_internal: noting a del on a non-setup event 0xb56248a0 (events: 0x0, fd: -1, flags: 0x80)
=== Command detached from window (Mon Feb 23 23:13:14 2015) ===
=== Command terminated with signal 6 (core file generated) (Mon Feb 23 23:13:14 2015) ===
Â

Using host libthread_db library "/lib/i386-linux-gnu/libthread_db.so.1".
Core was generated by `./vtc_app -Script-Client29-333'.
Program terminated with signal 6, Aborted.
#0Â 0xb775e424 in __kernel_vsyscall ()
(gdb) where
#0Â 0xb775e424 in __kernel_vsyscall ()
#1Â 0xb747a1df in raise () from /lib/i386-linux-gnu/libc.so.6
#2Â 0xb747d825 in abort () from /lib/i386-linux-gnu/libc.so.6
#3Â 0x0814bc31 in event_exit (errcode=-559030611) at log.c:79
#4Â 0x0814bd53 in event_errx (eval=-559030611,
ÂÂÂ fmt=0x82105b0 "%s: noting a del on a non-setup event %p (events: 0x%x, fd: %d, flags: 0x%x)") at log.c:136
#5Â 0x0813dc07 in event_del_internal (ev=<optimized out>) at event.c:2265
#6Â event_process_active_single_queue (activeq=<optimized out>, base=<optimized out>) at event.c:1324
#7Â event_process_active (base=<optimized out>) at event.c:1420
#8Â event_base_loop (base=0xb6a09d58, flags=2) at event.c:1621
#9Â 0x08073960 in VtcCommMgr::ThreadLoop (this=0xacdcc008) at vtc_app/vtc_comm_mgr.cpp:844
#10 0xb772cd4c in start_thread () from /lib/i386-linux-gnu/libpthread.so.0
#11 0xb753ad3e in clone () from /lib/i386-linux-gnu/libc.so.6


So looks like an event fired that was not in the hash table, which should only happen if the event was freed, from what I can tell. This points to a clear application bug, however, there is a bit more to the store.

Examining the libevent hash table that tracks the events, we found this:

==Â667Â0xb5e248f0 ==
ÂÂÂÂ {node = {hte_next = 0x0},Âptr = 0x0, added = 0}
global_debug_map.hth_table[666].node.hte_next
== 668 0xb5600858 ==
ÂÂÂÂ {node = {hte_next = 0x0}, ptr = 0xb5600808, added = 1}
global_debug_map.hth_table[667].node.hte_next
== 669 0xb5624950 ==
ÂÂÂÂ {node = {hte_next = 0x0}, ptr = 0xb5624900, added = 1}
global_debug_map.hth_table[668].node.hte_next
== 676 0x97a5390 ==
ÂÂÂÂ {node = {hte_next = 0x0}, ptr = 0x97a50a8, added = 0}
global_debug_map.hth_table[669].node.hte_next
Â

Bucket 667 has an entry, but the pointer to the event is 0. The hash entry should be deleted (mm_free()) when the event is freed, and have a valid event pointer otherwise, so the event likely was not freed. All of the other entries in the hash look good. The event that is currently being processed in the backtrace,Â0xb56248a0, maps to bucket 667 based on the hash algorithm: Â((0xb56248a0 >> 6) % 769) == 667, which is our bucket. This appears to be some sort of corruption if I am understanding the hash table logic correctly.

The event base was good and our application only calls event_free() and not any of the other APIs that could free an event from the hash (event_reinit() and a couple others).

I see in the 2.0.22 release notes that a race was fixed in event_active(). Do you think that could cause this? Our next step is to move to 2.0.22 on our test system to see if the problem still happens. However, this branch of code is very close to a release, so we are very cautious about changing things without a clear cause and effect.

We have had several engineers reviewing the application code and so far have not seen any invalid libevent usages, and as I said, valgrind is clean. Clearly we are doing something wrong in the application or have found some sort of edge case within the library. And libevent 2.0.21 has been very stable as I have used it on this project and others.

The application is complicated enough that I can't easily provide code snippets that are meaningful, and since we are not sure how to reproduce the problem, I don't have any test code that can reproduce it.

All of the clients are running under Ubuntu 12.04.

Has anyone else seen anything like this?

Thanks,
-Bill