Discussion:
active index mismatch in heap?
Brandon Black
2011-02-27 05:27:10 UTC
Permalink
Hi,
I'm getting the following assert() failure rarely when running with
libev 4.04 w/ EV_VERIFY=3 (this on Linux w/ epoll, a multi-threaded
app that uses one loop per thread). I haven't yet managed to catch
the assert while running under a debugger, it only seems to happen
when I'm running a "live" test daemon (I made it so that assert()
output would go to syslog), and even then it's only happened twice in
the course of a week or so:

Assertion '("libev: active index mismatch in heap", ((W)((heap
[i]).w))->active == i)' failed in verify_heap() at ./libev/ev.c:1978

At this point I'm just looking for some pointers to help me track down
what's going on: is this generally likely to be an application bug on
my part (I am using the public API only), or should I be looking more
at libev itself? I'd post code but the application is rather complex
and I haven't been able to reliably reproduce this even with the full
app, much less pare it down to a minimal case yet. In general I'm
only using normal ev_timer, not ev_periodic. Some of my timers are
short one-shot timers, and then there's one repeating timer which is
reset to a new interval with w->repeat = X; ev_timer_again() within
its callback (as part of a double-linked list setup like strategy #4
in ev.pod).

Is there some likely stupid error I could be making in my use of the
ev_timer API that would lead to this assert, that I could double-check
the code for specifically?

-- Brandon
Marc Lehmann
2011-02-27 07:38:10 UTC
Permalink
Post by Brandon Black
Assertion '("libev: active index mismatch in heap", ((W)((heap
[i]).w))->active == i)' failed in verify_heap() at ./libev/ev.c:1978
At this point I'm just looking for some pointers to help me track down
what's going on: is this generally likely to be an application bug on
my part (I am using the public API only), or should I be looking more
at libev itself?
Could be anywhere - what it means is that watchers often (and timers
specifically) store the array index of where they are stored in the ->active
member.

The assert means that the watcher at index i doesn't say it is at index i
(but somewhere else).

Could have lots of reasons - a bug in the heap code, a thread race
condition, something corrupting memory specifically, the index value) and
maybe other reasons.
Post by Brandon Black
Is there some likely stupid error I could be making in my use of the
ev_timer API that would lead to this assert, that I could double-check
the code for specifically?
Yes, but there are lots of possibilities. If you use threads, then a
possibility would be that you stop the watcher in another thread. It could
also be that you free an active watcher.

You could prepend an if statement to the assert check and log the value
of i and the value of ev_active (ANHE_w (i)). If i is somewhere between 1 and
roughly the numebr of active watchers, then it is at leats valid, and the bug
migth be timer mismanagement. If it's some widely different value then it's
likely memory corruption. If it's 0, then it could be a thread race.

However, that is all just "coulds" - every value could be any bug.

As for a bug in the heap manegement itself, I would say it's unlikely, as
that assert only checks whether the active value mismatches, and the only
assignments are:

heap [k] = *minpos;
ev_active (ANHE_w (*minpos)) = k;

heap [k] = he;
ev_active (ANHE_w (he)) = k;

heap [k] = heap [c];
ev_active (ANHE_w (heap [k])) = k;

heap [k] = he;
ev_active (ANHE_w (he)) = k;

heap [k] = heap [p];
ev_active (ANHE_w (heap [k])) = k;

heap [k] = he;
ev_active (ANHE_w (he)) = k;

Which makes explaining mismatches much harder. That leaves non-heap code
that might or might not manipulate ev_active.

As a first step, I would log the offending values of i and ev_active at
position i, and look for thread races, or forgetting to stop a watcher and
then reusing it. (e.g. free (), but also stuff like calling ev_timer_init
on an active timer).
--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / ***@schmorp.de
-=====/_/_//_/\_,_/ /_/\_\
Brandon Black
2011-03-08 15:29:51 UTC
Permalink
Post by Marc Lehmann
Post by Brandon Black
Assertion '("libev: active index mismatch in heap", ((W)((heap
[i]).w))->active == i)' failed in verify_heap() at ./libev/ev.c:1978
[...]
Yes, but there are lots of possibilities. If you use threads, then a
possibility would be that you stop the watcher in another thread. It could
also be that you free an active watcher.
Just for the record, I was eventually able to track down my error, and
it was a corner case where I was failing to stop a timer watcher
before free()-ing it. Thanks for the help

-- Brandon
Marc Lehmann
2011-03-09 09:23:14 UTC
Permalink
Post by Brandon Black
before free()-ing it. Thanks for the help
That's reassuring to hear, one bug less.

I wonder how much the slowdown of very frequent ev_verify's is compared
to running under valgrind (which presumably would have pinpointed this
problem nicely).
--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / ***@schmorp.de
-=====/_/_//_/\_,_/ /_/\_\
Brandon Black
2011-03-09 11:54:00 UTC
Permalink
Post by Marc Lehmann
I wonder how much the slowdown of very frequent ev_verify's is compared
to running under valgrind (which presumably would have pinpointed this
problem nicely).
I tried both with this application. Valgrind is considerably slower,
but works well enough for testing with just a handful of connections.
Running the daemon under gdb for a long time with lots of test clients
was what really nailed down the problem for me in the end, because it
was still fast enough to be usable, and when it aborted from the
assert I could look at a stack trace.

Loading...