[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]
Re: [Libevent-users] epoll erros
Maybe not too much help but it seems like strace would answer
possibilities 1, 2 and 3, that's how I typically track what is happening
with epoll inside libevent.
On Fri, Oct 22, 2010 at 01:38:45PM -0400, Nick Mathewson wrote:
> [summary: Nick looks for a bug, and finds only weirdness. But I think
> we're on to something, and we *are* making progress. Big find: there
> doesn't seem to be anything wrong with the changlist logic; something
> weird is up with epoll here. Could other people please help me look
> into this?]
>
> On Fri, Oct 22, 2010 at 12:20 AM, Gilad Benjamini
> <gilad@xxxxxxxxxxxxxxxxx> wrote:
> >> -----Original Message-----
> >> From: Nick Mathewson <nickm@xxxxxxxxxxxxx>
> >> Sent: Tuesday, September 21, 2010 7:43 PM
> >> To: libevent-users@xxxxxxxxxxxxx
> >> Subject: Re: [Libevent-users] epoll erros
> >> ...
> >>
> >> I just added a couple more detailed debugging messages in the git
> >> master branch to help try to figure out what exactly we're doing to
> >> epoll fd, and why. If you can get the same output with the new stuff
> >> from git master, that would be great.
> >>
> >
> > I am correct to assume these additions made it into 2.0.8-rc ?
> > Assuming they did, I am attaching the output when run with 2.0.8-rc.
> >
> > If that's not enough, please direct me to the specific changes and I'll run
> > with them
>
> Hm. Here's the first instance of the bug at issue here. Let's see if
> it means anything. I've cut out some lines only having to do with
> timeout events, so everything that remains is about the
> soon-to-be-infamous fd 13:
>
> > [debug] event_del: 0x801e7618 (fd 13), callback 0x80105870
>
> Here's the first appearance of fd 13. It's in event 0x801e7618, which
> I'll be calling 618 for short. Note that this is a delete of an event
> that hasn't been added yet. That's allowed.
>
> > [debug] event_add: event: 0x801e7618 (fd 13), EV_WRITE call 0x80105870
>
> Then we add that event again. It's an EV_WRITE event.
>
> > [debug] Epoll ADD(4) on fd 13 okay. [old events were 0; read change was 0; write change was 1]
>
> And now fd 13 hits epoll for the first time. We add it for
> writing(4==EPOLLOUT), since the write change is 1 (1==EV_CHANGE_ADD)
>
> > [debug] epoll_dispatch: epoll_wait reports 1
> > [debug] event_active: 0x801e7618 (fd 13), res 4, callback 0x80105870
> > [debug] event_process_active: event: 0x801e7618, EV_WRITE call 0x80105870
>
> The event triggers! We process it....
>
> > [debug] event_del: 0x801e7618 (fd 13), callback 0x80105870
>
> Then delete it.
>
> > [debug] event_add: event: 0x801e7618 (fd 13), EV_READ call 0x80105870
>
> Now we add it. This time 618 is a write event on fd 13. Interesting!
> It looks like the architecture of this program is to have one event
> per fd (or per fd of this kind), and use event_assign [or event_set if
> this is legacy code] to switch it between reading and writing as
> needed.
>
> This is allowed, but there's a possible bug to watch out for: you must
> never event_assign [or event_set] an event that has been added without
> first deleting it. That is almost certainly not happening here, since
> Gilad has said that his code is calling event_enable_debug_mode(), and
> that is supposed to catch any attempt to event_assign or event_set on
> an added event.
>
> It won't catch attempts to manually reassign the event's ev_events
> while the event is added. I sure hope nobody is doing that. Don't
> touch struct event's fields by hand.
>
> So this is a little bug-prone in the wrong hands, but there's no
> actual bug evident here yet.
>
> > [debug] Epoll MOD(1) on fd 13 okay. [old events were 4; read change was 1; write change was 2]
>
> So now we call epoll_ctl again, with command EPOLL_MOD(1==EPOLLIN).
> MOD is right, because fd 13 was added before. The write
> change(EV_CHANGE_DEL) is correct because we were previously writing.
> The read change is right because we added a read. The old events were
> indeed 4==EV_WRITE.
>
> > [debug] epoll_dispatch: epoll_wait reports 1
> > [debug] event_active: 0x801e7618 (fd 13), res 2, callback 0x80105870
> > [debug] event_process_active: event: 0x801e7618, EV_READ call 0x80105870
>
> The read event triggers.
>
> > [debug] epoll_dispatch: epoll_wait reports 1
> > [debug] event_active: 0x801e7618 (fd 13), res 2, callback 0x80105870
> > [debug] event_process_active: event: 0x801e7618, EV_READ call 0x80105870
>
> The read event triggers again.
>
> > [debug] event_del: 0x801e7618 (fd 13), callback 0x80105870
> > [debug] event_add: event: 0x801e7618 (fd 13), EV_WRITE call 0x80105870
> > [debug] event_del: 0x801e7618 (fd 13), callback 0x80105870
> > [debug] event_add: event: 0x801e7618 (fd 13), EV_READ call 0x80105870
>
> We delete it, then it gets added as a write event, and deleted, and
> added again as a read event.
>
> So that's event_del(read), event_add(write), event_del(write), and
> event_add(read). ADD-then-DEL coalesces into a no-op, but
> DEL-then-ADD coalesces into an ADD that might be redundant.
>
> > [debug] Epoll ADD(1) on fd 13 gave File exists: ADD was redundant
>
> Yup.
>
> > [debug] epoll_dispatch: epoll_wait reports 1
> > [debug] event_active: 0x801e7618 (fd 13), res 2, callback 0x80105870
> > [debug] event_process_active: event: 0x801e7618, EV_READ call 0x80105870
> > [debug] event_del: 0x801e7618 (fd 13), callback 0x80105870
>
> The read triggers, then we delete it.
>
> > [debug] event_process_active: event: 0x801e76a8, call 0x80102126
> > [debug] Epoll DEL(1) on fd 13 gave Bad file descriptor: DEL was unnecessary.
>
> When we go to delete it, we get an EBADF. This happens when the file
> is closed then we try to do an operation on it. It should be fine.
>
> > [debug] epoll_dispatch: epoll_wait reports 0
> > [debug] event_process_active: event: 0x801d5a98, call 0x80102126
> > [debug] event_del: 0x801e76a0 (fd 13), callback 0x80105870
> > [debug] event_add: event: 0x801e76a0 (fd 13), EV_WRITE call 0x80105870
> > [debug] event_del: 0x801e76a0 (fd 13), callback 0x80105870
> > [debug] event_add: event: 0x801e76a0 (fd 13), EV_READ call 0x80105870
>
> Now here comes a new event on fd 13: 0x801e76a0 (6a0 for short).
> Let's see what happens. We delete an (unknown) event, then add it for
> writing, then delete that, then add it again for reading.
>
> > [warn] Epoll ADD(1) on fd 13 failed. Old events were 0; read change was 1 (add); write change was 0 (none): File exists
>
> And boom, there's the bug. This is not consistent with the sequence
> above; if you replay the Epoll lines above with a test socketpair, and
> try to close and reopen the socket at the appropriate time, you'll
> find that you don't get the EEXIST here. I've attached a quick test
> program to demonstrate that this sequence of operations shouldn't get
> an EEXIST.
>
> To recap, the sequence of epoll events on 13 was:
> 1: [debug] Epoll ADD(4) on fd 13 okay. [old events were 0; read change
> was 0; write change was 1]
> 2: [debug] Epoll MOD(1) on fd 13 okay. [old events were 4; read change
> was 1; write change was 2]
> 3: [debug] Epoll ADD(1) on fd 13 gave File exists: ADD was redundant
> 4: (close the fd here, or else you won't see an EBADF in the next step)
> 5: [debug] Epoll DEL(1) on fd 13 gave Bad file descriptor: DEL was unnecessary.
> 6: (reopen the fd here, or else you *will* see an EBADF in the next step.)
> 7: [warn] Epoll ADD(1) on fd 13 failed. Old events were 0; read
> change was 1 (add); write change was 0 (none): File exists
>
> But when I run my test program with Linux 2.6.34.7, I get:
> 1: add(OUT) OK
> 2: mod(IN) OK
> 3: add(IN) 17: File exists
> 4: close
> 5: del(IN) 9: Bad file descriptor
> 6: reopen
> 7: add(IN) OK
>
> So, something's up! The logs have an eexist, an EEXIST, but the Epoll
> operations above don't cause an EEXIST on their own.
>
> A couple of possibilities, none of them very pleasant looking:
> 1) The debug logs are wrong. There are more epoll_ctl operations
> going on than are getting logged.
> 2) I am wrong about the close; something else causes the EBADF on step 5.
> 3) I am wrong about the reopen; something else prevents an EBADF at step 7.
> 4) The debug logs are confused somehow; the events are all happening,
> and no other events are happening, but things are somehow not as they
> seem and there is missing information that would make this all clear.
> 5) My test harness is somehow wrong for your situation; maybe if you
> use pipes or something, everthing is
>
> 1 seems unlikely from a simple read-through of the libevent code:
> there is no way to call epoll_ctl without a message getting logged as
> far as I can tell.
>
> If 2, then what else might cause an EBADF at step 5? If 3, then what
> else might prevent an EBADF at step 7? (A cursory read of
> fs/eventpoll.c in the Linux 2.6.34.7 source turns up nothing for
> either.)
>
> 4 is conceivable under a couple of weird circumstances. Might there
> be more than one event_base in the application, so that fd 13 is
> actually getting added/removed from separate event_bases? Might there
> be multithreading issues? Is the application forking, then calling
> (or not calling) event_reinit, then using Libevent stuff?
>
> 5 is also conceivable; if you can find some way to tweak the attached
> program so that step 7 gets an eexist from a series of operations
> similar to those in your program as logged above, I'd love to know.
>
> --
> Nick
***********************************************************************
To unsubscribe, send an e-mail to majordomo@xxxxxxxxxxxxx with
unsubscribe libevent-users in the body.