[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [Libevent-users] Signals and priority queues



On Jan 13, 2012, at 7:29 AM, Nick Mathewson wrote:

> On Fri, Jan 13, 2012 at 7:47 AM, Ralph Castain <rhc@xxxxxxxxxxxx> wrote:
>> I've been digging further into this, and I believe I have much of it resolved now. However, I have encountered a problem that appears to be something in libevent itself.
>> 
>> I configured libevent with debug enabled, and turned it on at execution - and was barraged by:
>> 
>> [warn] select: Invalid argument
>> 
>> Digging further into the reason, I found that the message comes from the following code in select_dispatch (file select.c):
> 
> Weird that you're using select.c; nearly any other backend would be faster.

It's on a Mac, so select is the option and speed isn't really an issue. We forcibly configure it there for OMPI purposes. :-/

     * Default to select() on OS X and poll() everywhere else because
     * various parts of OMPI / ORTE use libevent with pty's.  pty's
     * *only* work with select on OS X (tested on Tiger and Leopard);
     * we *know* that both select and poll works with pty's everywhere
     * else we care about (other mechansisms such as epoll *may* work
     * with pty's -- we have not tested comprehensively with newer
     * versions of Linux, etc.).  So the safe thing to do is:
     *
     * - On OS X, default to using "select" only
     * - Everywhere else, default to using "poll" only (because poll
     *   is more scalable than select)

> 
>> 
>>        res = select(nfds, sop->event_readset_out,
>>            sop->event_writeset_out, NULL, tv);
>> 
>>        EVBASE_ACQUIRE_LOCK(base, th_base_lock);
>> 
>>        check_selectop(sop);
>> 
>>        if (res == -1) {
>>                if (errno != EINTR) {
>>                        event_warn("select");
>>                        return (-1);
>>                }
>> 
>>                return (0);
>>        }
>> 
>> The timeout value being supplied to select_dispatch is being corrupted after the first time thru the routine - it comes into the routine the first time as {0, 0}, but is an illegal value thereafter. Resetting the timeout to the original value resolves the problem.
> 
> What kind of illegal value are you seeing,

1326467251, 774650

> coming from where?

I'm not sure who calls "select_dispatch" - the value is passed into it.

> Are you
> using the common_timeout code?

This is just flowing thru from a call to event_loop - I'm not sure of the progression that takes us down to select_dispatch.

>  What are you doing to "reset the
> timeout" ?

Just hacked things to save the value from the first call into the function, then replace it if there is a problem:

static struct timeval rhctv;
static int rhcfirst=1;
static int rhccnt=0;
static int rhcretry=0;

static int
select_dispatch(struct event_base *base, struct timeval *tv)
{
	int res=0, i, j, nfds;
	struct selectop *sop = base->evbase;

        if (1 == rhcfirst) {
            fprintf(stderr, "ORIGINAL TV %d sec %d usec\n", (int)tv->tv_sec, (int)tv->tv_usec);
            rhctv.tv_sec = tv->tv_sec;
            rhctv.tv_usec = tv->tv_usec;
            rhcfirst = 0;
        }
        rhccnt++;
        rhcretry = 0;

	check_selectop(sop);
	if (sop->resize_out_sets) {
		fd_set *readset_out=NULL, *writeset_out=NULL;
		size_t sz = sop->event_fdsz;
		if (!(readset_out = mm_realloc(sop->event_readset_out, sz)))
			return (-1);
		sop->event_readset_out = readset_out;
		if (!(writeset_out = mm_realloc(sop->event_writeset_out, sz))) {
			/* We don't free readset_out here, since it was
			 * already successfully reallocated. The next time
			 * we call select_dispatch, the realloc will be a
			 * no-op. */
			return (-1);
		}
		sop->event_writeset_out = writeset_out;
		sop->resize_out_sets = 0;
	}

	memcpy(sop->event_readset_out, sop->event_readset_in,
	       sop->event_fdsz);
	memcpy(sop->event_writeset_out, sop->event_writeset_in,
	       sop->event_fdsz);

	nfds = sop->event_fds+1;

 retry:
	EVBASE_RELEASE_LOCK(base, th_base_lock);

	res = select(nfds, sop->event_readset_out,
	    sop->event_writeset_out, NULL, tv);

	EVBASE_ACQUIRE_LOCK(base, th_base_lock);

	check_selectop(sop);

	if (res == -1) {
		if (errno != EINTR) {
			event_warn("select");
                        fprintf(stderr, "TV OUT OF SPEC AT CNT %d: value %d:%d\n", rhccnt, tv->tv_sec, tv->tv_usec);
                        tv->tv_sec = rhctv.tv_sec;
                        tv->tv_usec = rhctv.tv_usec;
                        if (0 == rhcretry) {
                            rhcretry = 1;
                            goto retry;
                        } else {
                            exit(0);
                        }
			return (-1);
		}

		return (0);
	}
...

Retrying select with the corrected value always succeeds. It's clearly being overwritten somewhere, but I don't know enough of libevent's internal call sequence to figure out where/why. Note that this comes after loops through that event create/activate sequence we were discussing. I'm going to try and see if a minimal reproducer can be created based on that code.

> 
> -- 
> Nick
> ***********************************************************************
> To unsubscribe, send an e-mail to majordomo@xxxxxxxxxxxxx with
> unsubscribe libevent-users    in the body.

***********************************************************************
To unsubscribe, send an e-mail to majordomo@xxxxxxxxxxxxx with
unsubscribe libevent-users    in the body.