[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [Libevent-users] Signals and priority queues



On Jan 13, 2012, at 9:53 AM, Nick Mathewson wrote:

> On Fri, Jan 13, 2012 at 11:46 AM, Ralph Castain <rhc@xxxxxxxxxxxx> wrote:
> 
>>> Adding some assertions in event_add_internal might track this down.
>>> Trivially, you could do
>>>   if (tv && !tv_is_absolute) {
>>>       /* waiting one billion seconds should be enough for anyone */
>>>       EVUTIL_ASSERT(tv->tv_sec < 1000000000);
>>>   }
>>> 
>>> to try to detect 1 and 2.
>> 
>> Interesting. The above code never tripped, so I dug a little further and found that event_add_internal is never being called with a tv value that is large. I did find it to be a race condition - sometimes the  code completes and exits before I get the error condition report.
>> 
>> The timeout value clearly isn't a garbage value - I dumped the values out, compared to current time as of the error:
>> 
>> warn] select: Invalid argument
>> TV OUT OF SPEC AT CNT 2: value 1326472513:976848 curtime 1326472513:977043
>> Ralph
>> [warn] select: Invalid argument
>> TV OUT OF SPEC AT CNT 3: value 1326472513:977327 curtime 1326472513:977413
>> 
>> So the value is getting updated and appears valid. What's strange is why libevent is passing an absolute time to select as it is supposed to be a relative value per the man page:
> 
> Right.
> 
> [...]
>> Any easy way I can output an identifier that would tell us something about which event is involved? I see that I'm not getting output from the event_debug calls in the code, even though I've configured with debug enabled and called:
>> 
>>        event_enable_debug_mode();
>>        event_set_debug_output(1);
>> 
>> Anything else required to get that output? Would it help?
> 
> Maybe.  I'd look at adding debugging logs or printfs to
> event_base_loop and timeout_next() and timeout_correct(): Those are
> the ones that determine the value of the timeout to be passed to
> evsel->dispatch.  The event that's getting the weird value is set by
>        ev = min_heap_top(&base->timeheap);
> in timeout_next.

Well, well - how interesting! I couldn't find an out-of-bounds value in any of those places, so I looked a little deeper. Put a print statement in timeout_next:

static int
timeout_next(struct event_base *base, struct timeval **tv_p)
{
	/* Caller must hold th_base_lock */
	struct timeval now;
	struct event *ev;
	struct timeval *tv = *tv_p;
	int res = 0;

	ev = min_heap_top(&base->timeheap);

	if (ev == NULL) {
		/* if no time-based events are active wait for I/O */
            fprintf(stderr, "NO TIME-BASED EVENTS ACTIVE - tvp %d\n", (tv == NULL) ? -1 : (int)tv->tv_sec);
		goto out;
	}

        
and sure enough, that print came out every time. So there are no time-based events in the system, which confirms what I had found when auditing the code. I also found that the value being passed into timeout_next is the current system time.

So it looks like, in the absence of a time-based event, you pass the current system time into evsel->dispatch.

I'm wondering if this is a side-effect of using the fake fd events:

    if (event_assign(t2, base, -1, EV_WRITE, t2func, t2) < 0) {
        die("event_assign_term");
    }
    if (event_priority_set(t2, TERMPRI) < 0) {
        die("event_priority_set_term");
    }
    event_active(t2, EV_WRITE, 1);

Other than the signal traps, these are all I have defined. I removed the signal events, but saw no change in behavior. So it must be something about the way I'm doing these events that is causing it to give a bad value to select...

To review, I setup an initial event per above. When that event fires, the handler creates and activates another event (same method as above) that is supposed to send us to the next step of the state machine. The handler then free's the memory holding its event and returns. I don't know if the free is causing an issue (it is the active event that is being free'd) - just needed to avoid the memory leak somehow.

I see select_dispatch getting called after the first event is activated, and the timeout value is fine (0,0). The rest of the event "chain" fires without select being called. At the end of the chain, I see select_dispatch being called again - with the current time as the timeout value.

Any suggestions on where to look next would be most appreciated. Still working on a reliable reproducer. :-/

HTH
Ralph


> -- 
> Nick
> ***********************************************************************
> To unsubscribe, send an e-mail to majordomo@xxxxxxxxxxxxx with
> unsubscribe libevent-users    in the body.

***********************************************************************
To unsubscribe, send an e-mail to majordomo@xxxxxxxxxxxxx with
unsubscribe libevent-users    in the body.