[Author Prev][Author Next][Thread Prev][Thread Next][Author Index][Thread Index]

Re: [Libevent-users] epoll erros



On Fri, Oct 22, 2010 at 4:25 PM, Nicholas Marriott
<nicholas.marriott@xxxxxxxxx> wrote:
> I think something needs to walk the file->f_ep_links list on close() and
> remove any epitems where epi->ffd->fd is the fd being closed from the
> tree in epi->ep.
>
> I don't have a Linux box to try this on though :-).

I think you're right.  I don't do kernel hacking myself though, so
maybe somebody else should take this one on.

In the meantime, this doesn't help people using any of the kernels
currently in the wild.  Fortunately, part of the point of Libevent is
to work around weird kernel behaviors.  Since the epoll tree _does_
keep a separate struct epitem for every (file, fd) pair, I believe
that we can work around this behavior by having epoll fall back to
EPOLL_CTL_MOD when EPOLL_CTL_ADD fails with EEXIST.

Incidentally, there's another, more sinister bug hiding here.  When
Libevent's changelist code coalesces a (DEL,ADD) combo, rather than
making it a no-op, it makes it into an "ADD", since it's possible that
the fd was closed in between the DEL and the ADD.  It calls these adds
"precautionary".  When the current epoll code encounters such an add
and the add fails with EEXIST, we currently ignore the failure with a
debug message.  But if the user has been using dup() in the same way
as Gilad's original code, this means that they'll think they have
added the event, when in fact they haven't.

I've attached a series of two patches that try to get working behavior
here.  See the commit messages for details.  Please test them if you
can; they'll probably go into 2.0.9 if nobody finds them to be buggy.
They are also branch epoll_dup in my github repo.

In the process, I also decided that maybe epoll_apply_changes should
have fewer conditional branches, since they tend to be a poor choice
on modern CPUs.  I've got a branch epoll_table in my github repo that
replaces the huge pile of conditionals at the start of
eopll_apply_changes with a table lookup.  It is *NOT* under
consideration for 2.0, since it's a relatively large change and it
doesn't fix a bug, but if anybody wants to have a look, that would be
cool tool.

(My github repo is at git://github.com/nmathewson/Libevent.git )

yrs,
-- 
Nick
From c281aba30e69a501fc183d068894bfa47f891700 Mon Sep 17 00:00:00 2001
From: Nick Mathewson <nickm@xxxxxxxxxxxxxx>
Date: Sun, 24 Oct 2010 11:38:29 -0400
Subject: [PATCH 1/2] Fix a nasty bug related to use of dup() with epoll on Linux

Current versions of the Linux kernel don't seem to remove the struct
epitem for a given (file,fd) combo when the fd is closed unless the
file itself is also completely closed.  This means that if you do:
   fd = dup(fd_orig);
   add(fd);
   close(fd);
   dup2(fd_orig, fd);
   add(fd);
you will get an EEXIST when you should have gotten a success.  This
could cause warnings and dropped events when using dup and epoll.

The solution is pretty simple: when we get an EEXIST from
EPOLL_CTL_ADD, we retry with EPOLL_CTL_MOD.

Unit test included to demonstrate the bug.

Found due to the patient efforts of Gilad Benjamini; diagnosed with
help from Nicholas Marriott.
---
 epoll.c        |   32 ++++++++++++++----------
 test/regress.c |   73 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 92 insertions(+), 13 deletions(-)

diff --git a/epoll.c b/epoll.c
index b574bf4..9c8f0e9 100644
--- a/epoll.c
+++ b/epoll.c
@@ -155,7 +155,6 @@ epoll_apply_changes(struct event_base *base)
 	int op, events;
 
 	for (i = 0; i < changelist->n_changes; ++i) {
-		int precautionary_add = 0;
 		ch = &changelist->changes[i];
 		events = 0;
 
@@ -203,13 +202,12 @@ epoll_apply_changes(struct event_base *base)
 				  on the fd, we need to try the ADD anyway, in
 				  case the fd was closed at some in the
 				  middle.  If it wasn't, the ADD operation
-				  will fail with; that's okay.
-				*/
-				precautionary_add = 1;
+				  will fail with EEXIST; and we retry it as a
+				  MOD. */
+				op = EPOLL_CTL_ADD;
 			} else if (ch->old_events) {
 				op = EPOLL_CTL_MOD;
 			}
-
 		} else if ((ch->read_change & EV_CHANGE_DEL) ||
 		    (ch->write_change & EV_CHANGE_DEL)) {
 			/* If we're deleting anything, we'll want to do a MOD
@@ -255,14 +253,22 @@ epoll_apply_changes(struct event_base *base)
 						(int)epev.events,
 						ch->fd));
 				}
-			} else if (op == EPOLL_CTL_ADD && errno == EEXIST &&
-			    precautionary_add) {
-				/* If a precautionary ADD operation fails with
-				   EEXIST, that's fine too.
-				 */
-				event_debug(("Epoll ADD(%d) on fd %d gave %s: ADD was redundant",
-					(int)epev.events,
-					ch->fd, strerror(errno)));
+			} else if (op == EPOLL_CTL_ADD && errno == EEXIST) {
+				/* If an ADD operation fails with EEXIST,
+				 * either the operation was redundant (as with a
+				 * precautionary add), or we ran into a fun
+				 * kernel bug where using dup*() to duplicate the
+				 * same file into the same fd gives you the same epitem
+				 * rather than a fresh one.  For the second case,
+				 * we must retry with MOD. */
+				if (epoll_ctl(epollop->epfd, EPOLL_CTL_MOD, ch->fd, &epev) == -1) {
+					event_warn("Epoll ADD(%d) on %d retried as MOD; that failed too",
+					    (int)epev.events, ch->fd);
+				} else {
+					event_debug(("Epoll ADD(%d) on %d retried as MOD; succeeded.",
+						(int)epev.events,
+						ch->fd));
+				}
 			} else if (op == EPOLL_CTL_DEL &&
 			    (errno == ENOENT || errno == EBADF ||
 				errno == EPERM)) {
diff --git a/test/regress.c b/test/regress.c
index 21c921b..0cc4017 100644
--- a/test/regress.c
+++ b/test/regress.c
@@ -2061,6 +2061,76 @@ end:
 	}
 }
 
+#ifndef WIN32
+/* You can't do this test on windows, since dup2 doesn't work on sockets */
+
+static void
+dfd_cb(evutil_socket_t fd, short e, void *data)
+{
+	*(int*)data = (int)e;
+}
+
+/* Regression test for our workaround for a fun epoll/linux related bug
+ * where fd2 = dup(fd1); add(fd2); close(fd2); dup2(fd1,fd2); add(fd2)
+ * will get you an EEXIST */
+static void
+test_dup_fd(void *arg)
+{
+	struct basic_test_data *data = arg;
+	struct event_base *base = data->base;
+	struct event *ev1=NULL, *ev2=NULL;
+	int fd, dfd=-1;
+	int ev1_got, ev2_got;
+
+	tt_int_op(write(data->pair[0], "Hello world",
+		strlen("Hello world")), >, 0);
+	fd = data->pair[1];
+
+	dfd = dup(fd);
+	tt_int_op(dfd, >=, 0);
+
+	ev1 = event_new(base, fd, EV_READ|EV_PERSIST, dfd_cb, &ev1_got);
+	ev2 = event_new(base, dfd, EV_READ|EV_PERSIST, dfd_cb, &ev2_got);
+	ev1_got = ev2_got = 0;
+	event_add(ev1, NULL);
+	event_add(ev2, NULL);
+	event_base_loop(base, EVLOOP_ONCE);
+	tt_int_op(ev1_got, ==, EV_READ);
+	tt_int_op(ev2_got, ==, EV_READ);
+
+	/* Now close and delete dfd then dispatch.  We need to do the
+	 * dispatch here so that when we add it later, we think there
+	 * was an intermediate delete. */
+	close(dfd);
+	event_del(ev2);
+	ev1_got = ev2_got = 0;
+	event_base_loop(base, EVLOOP_ONCE);
+	tt_want_int_op(ev1_got, ==, EV_READ);
+	tt_int_op(ev2_got, ==, 0);
+
+	/* Re-duplicate the fd.  We need to get the same duplicated
+	 * value that we closed to provoke the epoll quirk.  Also, we
+	 * need to change the events to write, or else the old lingering
+	 * read event will make the test pass whether the change was
+	 * successful or not. */
+	tt_int_op(dup2(fd, dfd), ==, dfd);
+	event_free(ev2);
+	ev2 = event_new(base, dfd, EV_WRITE|EV_PERSIST, dfd_cb, &ev2_got);
+	event_add(ev2, NULL);
+	ev1_got = ev2_got = 0;
+	event_base_loop(base, EVLOOP_ONCE);
+	tt_want_int_op(ev1_got, ==, EV_READ);
+	tt_int_op(ev2_got, ==, EV_WRITE);
+
+end:
+	if (ev1)
+		event_free(ev1);
+	if (ev2)
+		event_free(ev2);
+	close(dfd);
+}
+#endif
+
 #ifdef _EVENT_DISABLE_MM_REPLACEMENT
 static void
 test_mm_functions(void *arg)
@@ -2229,6 +2299,9 @@ struct testcase_t main_testcases[] = {
 	{ "event_once", test_event_once, TT_ISOLATED, &basic_setup, NULL },
 	{ "event_pending", test_event_pending, TT_ISOLATED, &basic_setup,
 	  NULL },
+#ifndef WIN32
+	{ "dup_fd", test_dup_fd, TT_ISOLATED, &basic_setup, NULL },
+#endif
 	{ "mm_functions", test_mm_functions, TT_FORK, NULL, NULL },
 	BASIC(many_events, TT_ISOLATED),
 
-- 
1.7.2.3

From 2c66983a6d46cbd863b079a1c92c916b41efe0ba Mon Sep 17 00:00:00 2001
From: Nick Mathewson <nickm@xxxxxxxxxxxxxx>
Date: Sun, 24 Oct 2010 11:51:14 -0400
Subject: [PATCH 2/2] Simplify the logic for choosing EPOLL_CTL_ADD vs EPOLL_CTL_MOD

Previously, we chose "ADD" whenever old_events==new_events, (since
we expected the add to fail with EEXIST), or whenever old_events
was==0, and MOD otherwise (i.e., when old_events was nonzero and not
equal to new_events).

But now that we retry failed MOD events as ADD *and* failed ADD
events as MOD, the important thing is now to try to guess right the
largest amount of the time, since guessing right means we do only
one syscall, but guessing wrong means we do two.

When old_events is 0, ADD is probably right (unless we're hitting
the dup bug, when we'll fall back).

And when old_events is set and != new_events, MOD is almost
certainly right for the same reasons as before.

But when old_events is equal to new events, then MOD will work fine
unless we closed and reopened the fd, in which case we'll have to
fall back to the ADD case.  (Redundant del/add pairs are more common
than closes for most use cases.)

This change lets us avoid calculating new_events, which ought to
save a little time in epoll.c
---
 epoll.c |   35 +++++++++++++++++++----------------
 1 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/epoll.c b/epoll.c
index 9c8f0e9..672025f 100644
--- a/epoll.c
+++ b/epoll.c
@@ -169,43 +169,46 @@ epoll_apply_changes(struct event_base *base)
 
 		*/
 
+		/* TODO: Turn this into a switch or a table lookup. */
+
 		if ((ch->read_change & EV_CHANGE_ADD) ||
 		    (ch->write_change & EV_CHANGE_ADD)) {
 			/* If we are adding anything at all, we'll want to do
 			 * either an ADD or a MOD. */
-			short new_events = ch->old_events;
 			events = 0;
 			op = EPOLL_CTL_ADD;
 			if (ch->read_change & EV_CHANGE_ADD) {
 				events |= EPOLLIN;
-				new_events |= EV_READ;
 			} else if (ch->read_change & EV_CHANGE_DEL) {
-				new_events &= ~EV_READ;
+				;
 			} else if (ch->old_events & EV_READ) {
 				events |= EPOLLIN;
 			}
 			if (ch->write_change & EV_CHANGE_ADD) {
 				events |= EPOLLOUT;
-				new_events |= EV_WRITE;
 			} else if (ch->write_change & EV_CHANGE_DEL) {
-				new_events &= ~EV_WRITE;
+				;
 			} else if (ch->old_events & EV_WRITE) {
 				events |= EPOLLOUT;
 			}
 			if ((ch->read_change|ch->write_change) & EV_ET)
 				events |= EPOLLET;
 
-			if (new_events == ch->old_events) {
-				/*
-				  If the changelist has an "add" operation,
-				  but no visible change to the events enabled
-				  on the fd, we need to try the ADD anyway, in
-				  case the fd was closed at some in the
-				  middle.  If it wasn't, the ADD operation
-				  will fail with EEXIST; and we retry it as a
-				  MOD. */
-				op = EPOLL_CTL_ADD;
-			} else if (ch->old_events) {
+			if (ch->old_events) {
+				/* If MOD fails, we retry as an ADD, and if
+				 * ADD fails we will retry as a MOD.  So the
+				 * only hard part here is to guess which one
+				 * will work.  As a heuristic, we'll try
+				 * MOD first if we think there were old
+				 * events and ADD if we think there were none.
+				 *
+				 * We can be wrong about the MOD if the file
+				 * has in fact been closed and re-opened.
+				 *
+				 * We can be wrong about the ADD if the
+				 * the fd has been re-created with a dup()
+				 * of the same file that it was before.
+				 */
 				op = EPOLL_CTL_MOD;
 			}
 		} else if ((ch->read_change & EV_CHANGE_DEL) ||
-- 
1.7.2.3