[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

[Libevent-users] Race Condition resulting in busy loop



Hi,

I have had all instances of an application unable to break out of the timeout_process() while loop. More precisely, I'm in an infinte loop in https://github.com/libevent/libevent/blob/release-2.1.5-beta/event.c#L3086-L3096


while ((ev = min_heap_top_(&base->timeheap))) {
if (evutil_timercmp(&ev->ev_timeout, &now, >))
break;
/* delete this event from the I/O queues */
event_del_nolock_(ev, EVENT_DEL_NOBLOCK);
event_debug(("timeout_process: event: %p, call %p",
ev, ev->ev_callback));
event_active_nolock_(ev, EV_TIMEOUT, 1);
}



I have gdb access to one process which hasn't been restarted and can print out the values of libevent structures from the stack track if anyone can help shed some light on how this race condition was triggered. 
I should mention that 
a) libevent was built with NDEBUG flag - I've already had one unit test fail because of this and have subsequently to switched to releasing a newer build without NDEBUG. However there were some clients which picked up the library built with NDEBUG.
b) An https request callback was invoked with "evhttp_request_get_response_code" returning 0. (This was a call to S3 in US-EAST-1 during the outage on 02/28/2017, so I don't have much details on how to recreate this issue given that AWS is keeping quiet on the same.)

The following is a snippet from gdb stack trace.

#0  0x00007f212cf942d2 in timeout_process (base=0x153b090) at libevent-2.1.5-beta/event.c:3095
#1  event_base_loop (base=0x153b090, flags=<optimized out>) at libevent-2.1.5-beta/event.c:1909
(gdb) info locals
now = {tv_sec = 1488305174, tv_usec = 870146}
ev = 0x19dafe0
(gdb) p *base
$1 = {evsel = 0x7f212d1be380 <epollops>, evbase = 0x153b330, changelist = {changes = 0x0, n_changes = 0, changes_size = 0}, evsigsel = 0x7f212d1be2c0 <evsigops>, sig = {ev_signal = {ev_evcallback = {evcb_active_next = {tqe_next = 0x0,
          tqe_prev = 0x153b4e0}, evcb_flags = 146, evcb_pri = 0 '\000', evcb_closure = 2 '\002', evcb_cb_union = {evcb_callback = 0x7f212cf9b3f0 <evsig_cb>, evcb_selfcb = 0x7f212cf9b3f0 <evsig_cb>,
          evcb_evfinalize = 0x7f212cf9b3f0 <evsig_cb>, evcb_cbfinalize = 0x7f212cf9b3f0 <evsig_cb>}, evcb_arg = 0x153b090}, ev_timeout_pos = {ev_next_with_common_timeout = {tqe_next = 0xffffffff, tqe_prev = 0x0}, min_heap_idx = -1},
      ev_fd = 5, ev_base = 0x153b090, ev_ = {ev_io = {ev_io_next = {le_next = 0x0, le_prev = 0x17fc760}, ev_timeout = {tv_sec = 0, tv_usec = 0}}, ev_signal = {ev_signal_next = {le_next = 0x0, le_prev = 0x17fc760}, ev_ncalls = 0,
          ev_pncalls = 0x0}}, ev_events = 18, ev_res = 2, ev_timeout = {tv_sec = 0, tv_usec = 0}}, ev_signal_pair = {5, 6}, ev_signal_added = 1, ev_n_signals_added = 6, sh_old = 0x183a680, sh_old_max = 16}, virtual_event_count = 0,
  virtual_event_count_max = 0, event_count = 19, event_count_max = 23, event_count_active = 0, event_count_active_max = 5, event_gotterm = 0, event_break = 0, event_continue = 0, event_running_priority = -1, running_loop = 1,
  n_deferreds_queued = 0, activequeues = 0x153b4e0, nactivequeues = 1, active_later_queue = {tqh_first = 0x0, tqh_last = 0x153b198}, common_timeout_queues = 0x0, n_common_timeouts = 0, n_common_timeouts_allocated = 0, io = {
    entries = 0x153b740, nentries = 32}, sigmap = {entries = 0x1839550, nentries = 32}, timeheap = {p = 0x153c430, n = 5, a = 8}, tv_cache = {tv_sec = 1488305174, tv_usec = 870146}, monotonic_timer = {adjust_monotonic_clock = {
      tv_sec = 0, tv_usec = 0}, last_time = {tv_sec = 1488305174, tv_usec = 870146}}, tv_clock_diff = {tv_sec = 0, tv_usec = 0}, last_updated_clock_diff = 1488305174, th_owner_id = 1, th_base_lock = 0x0, current_event_cond = 0x0,
  current_event_waiters = 0, current_event = 0x0, flags = EVENT_BASE_FLAG_PRECISE_TIMER, max_dispatch_time = {tv_sec = -1, tv_usec = 0}, max_dispatch_callbacks = 2147483647, limit_callbacks_after_prio = 2147483647,
  is_notify_pending = 0, th_notify_fd = {-1, -1}, th_notify = {ev_evcallback = {evcb_active_next = {tqe_next = 0x0, tqe_prev = 0x0}, evcb_flags = 0, evcb_pri = 0 '\000', evcb_closure = 0 '\000', evcb_cb_union = {evcb_callback = 0x0,
        evcb_selfcb = 0x0, evcb_evfinalize = 0x0, evcb_cbfinalize = 0x0}, evcb_arg = 0x0}, ev_timeout_pos = {ev_next_with_common_timeout = {tqe_next = 0x0, tqe_prev = 0x0}, min_heap_idx = 0}, ev_fd = 0, ev_base = 0x0, ev_ = {ev_io = {
        ev_io_next = {le_next = 0x0, le_prev = 0x0}, ev_timeout = {tv_sec = 0, tv_usec = 0}}, ev_signal = {ev_signal_next = {le_next = 0x0, le_prev = 0x0}, ev_ncalls = 0, ev_pncalls = 0x0}}, ev_events = 0, ev_res = 0, ev_timeout = {
      tv_sec = 0, tv_usec = 0}}, th_notify_fn = 0x0, weakrand_seed = {seed = 0}, _once_events_ = {lh_first = 0x0}}
(gdb) p *ev
$2 = {ev_evcallback = {evcb_active_next = {tqe_next = 0x0, tqe_prev = 0x153b4e0}, evcb_flags = 128, evcb_pri = 0 '\000', evcb_closure = 2 '\002', evcb_cb_union = {evcb_callback = 0x7f212cf9d430 <be_openssl_handshakeeventcb>,
      evcb_selfcb = 0x7f212cf9d430 <be_openssl_handshakeeventcb>, evcb_evfinalize = 0x7f212cf9d430 <be_openssl_handshakeeventcb>, evcb_cbfinalize = 0x7f212cf9d430 <be_openssl_handshakeeventcb>}, evcb_arg = 0x19daf50},
  ev_timeout_pos = {ev_next_with_common_timeout = {tqe_next = 0xffffffff, tqe_prev = 0x0}, min_heap_idx = -1}, ev_fd = -1, ev_base = 0x153b090, ev_ = {ev_io = {ev_io_next = {le_next = 0x19daf60, le_prev = 0x19c16a0}, ev_timeout = {
        tv_sec = 0, tv_usec = 0}}, ev_signal = {ev_signal_next = {le_next = 0x19daf60, le_prev = 0x19c16a0}, ev_ncalls = 0, ev_pncalls = 0x0}}, ev_events = 84, ev_res = 1, ev_timeout = {tv_sec = 1488305174, tv_usec = 866138}}
(gdb) p ev->ev_timeout
$3 = {tv_sec = 1488305174, tv_usec = 866138}
(gdb) p now
$4 = {tv_sec = 1488305174, tv_usec = 870146}


As you can observe, the ev_timeout for the event obtained from priority queue is always less than current time and hence it never breaks out of the loop. I have interrupted the process multiple times and it's always within the stack call trace for timeout_process and always using the same "ev" event address from the priority queue.

I assume there are some optimizations behind why the value of "now" is not refreshed within the loop.


--
Sanjiv Raj


Marvin: I've been talking to the ship's computer.
Arthur: And?
Marvin: It hates me.