On Mon, Dec 20, 2010 at 01:32:50AM -0500, Kevin Bowling wrote:
> I'm at wit's end with a libevent threading bug. As part of a disconnect
> client routine, I manually call the errorcb with EVENT_ERROR_EOF. The idea
> was to keep all the cleanup code in one callback but I'm beginning to think
> this was ill-conceived. Somehow, libevent is entering a condition wait and
> it looks like the event base is never releasing the cv mutex.
>
> For the worker threads, the workers are passed a *bev and *ctx pointer
> through a singly linked list that is mutex protected. When a client
> disconnects, a reaper is run in the errorcb to remove any pending work
> entries that match that *bev pointer. This happens mutually exclusive to
> the workers so a worker should not consume a null *bev. I'm guessing the
> errorcb is being called twice.
>
> It may be a longshot asking such a broad question but any advice is
> appreciated.
>
> Here is the backtrace of the worker thread(6), and the event base(1):
If you can give me a method of reproducing the issue with your code, I
can give it a quick lookover.
It would be quite difficult without the game client, which is closed source/pay (and not mine).
I've got a pretty good idea of what is going on. Since I was manually invoking the errorcb function from another thread, the main event base is still doing read/write callbacks in the background. Under race condition, the bev is freed by the worker thread while the read or write callbacks are trying work with it.
I've looked all over the libevent API and source, but I can see no clean method for bufferevent cancellation from another thread. Calling bufferevent_free from the workers could still result in a race.
What I'm leaning toward now is a making an event_base_once() call to schedule a new event that calls the errorcb from the main thread. The event loop will naturally ensure the errorcb and read/write are mutually exclusive. I _think_ my errorcb code will clean out the workers w/o race under such terms (but I really need sleep :)) in which case things are deterministic again. If that fails, I will have to resort to r/w locks in the errorcb and workers which will make the solution a bit more expensive.