* [Patch 01/12] CLD: fix crash in retransmissions
@ 2010-04-18 4:37 Pete Zaitcev
2010-04-19 3:46 ` Jeff Garzik
0 siblings, 1 reply; 3+ messages in thread
From: Pete Zaitcev @ 2010-04-18 4:37 UTC (permalink / raw)
To: Jeff Garzik; +Cc: Project Hail List
For a longest time I was plagued by (very infrequent) crashes like this:
Program received signal SIGSEGV, Segmentation fault.
sess_retry_output (timer=0x92070c0) at session.c:532
532 if (!next_retry || (op->next_retry < next_retry))
(gdb) info threads
* 1 Thread 0xb72f96c0 (LWP 22417) sess_retry_output (timer=0x92070c0) at session.c:532
(gdb) where
#0 sess_retry_output (timer=0x92070c0) at session.c:532
#1 session_retry (timer=0x92070c0) at session.c:565
#2 0x08049aee in cld_timers_run (tlist=0x8056630) at ../lib/libtimer.c:95
#3 0x0804e9cc in main_loop (argc=5, argv=0xbff70bd4) at server.c:983
#4 main (argc=5, argv=0xbff70bd4) at server.c:1138
The crash happens because op is NULL. As it turned out, this happens
if a packet retransmit and a session expiration occur simultaneously
(in the same pass of timers_run). The scenario is:
- timers_run collects expired timers at exec list
- timers_run expires session
- two timer_del are called, but one of them is on exec list already,
so it's ineffective
- session is freed, this zeroes ->data in lists (later op)
- timers_run continues along the exec list, invokes the retransmission
callback, and that crashes with NULL op.
The proposed solution is to rework the timers_run, again. But this
time, we'll make it simpler by observing that timers are ordered by
expiration time. Therefore, we can pull next timer off the list,
expire it, and loop until expiration time is greater than the current
time. No execution list is kept. The integrity of the main list
is assured by never walking it and always referring to the head
anew at each iteration.
This patch appears to fix the problem and stands up to use that
crashed the old code.
Signed-off-by: Pete Zaitcev <zaitcev@redhat.com>
---
include/cld_common.h | 10 ++++++----
lib/libtimer.c | 41 ++++++++++++++++++++---------------------
2 files changed, 26 insertions(+), 25 deletions(-)
I am absurdly proud of this. But the long history of timer problem
in CLD makes me cautious. I already thought that we fixed everything
in the past.
commit c1b4a0c4dda14d9fbc4d5896e9e0dc139c31798b
Author: Master <zaitcev@lembas.zaitcev.lan>
Date: Sat Apr 17 19:02:35 2010 -0600
Fix persistent CLD crashes in retransmits (op == NULL).
diff --git a/include/cld_common.h b/include/cld_common.h
index d269e6c..379bd9e 100644
--- a/include/cld_common.h
+++ b/include/cld_common.h
@@ -24,6 +24,7 @@
#include <stdbool.h>
#include <string.h>
#include <time.h>
+#include <glib.h>
#include <openssl/sha.h>
#include <cld_msg_rpc.h>
@@ -31,10 +32,6 @@
struct hail_log;
-struct cld_timer_list {
- void *list;
-};
-
struct cld_timer {
bool fired;
bool on_list;
@@ -44,6 +41,11 @@ struct cld_timer {
char name[32];
};
+struct cld_timer_list {
+ GList *list; /* of struct cld_timer */
+ time_t runmark;
+};
+
extern void cld_timer_add(struct cld_timer_list *tlist, struct cld_timer *timer,
time_t expires);
extern void cld_timer_del(struct cld_timer_list *tlist, struct cld_timer *timer);
diff --git a/lib/libtimer.c b/lib/libtimer.c
index c6f5241..75514f0 100644
--- a/lib/libtimer.c
+++ b/lib/libtimer.c
@@ -44,6 +44,17 @@ void cld_timer_add(struct cld_timer_list *tlist, struct cld_timer *timer,
if (timer->on_list)
timer_list = g_list_remove(timer_list, timer);
+ /*
+ * This additional resiliency is required by the invocations from
+ * session_retry(). For some reason the computations in it result
+ * in attempts to add timers in the past sometimes, and then we loop
+ * when trying to run those. FIXME: maybe fix that one day.
+ *
+ * Even if we fix the callers, we probably should keep this.
+ */
+ if (expires < tlist->runmark + 1)
+ expires = tlist->runmark + 1;
+
timer->on_list = true;
timer->fired = false;
timer->expires = expires;
@@ -66,38 +77,26 @@ time_t cld_timers_run(struct cld_timer_list *tlist)
struct cld_timer *timer;
time_t now = time(NULL);
time_t next_timeout = 0;
- GList *tmp, *cur;
- GList *timer_list = tlist->list;
- GList *exec_list = NULL;
-
- tmp = timer_list;
- while (tmp) {
- timer = tmp->data;
- cur = tmp;
- tmp = tmp->next;
+ GList *cur;
+ tlist->runmark = now;
+ for (;;) {
+ cur = tlist->list;
+ if (!cur)
+ break;
+ timer = cur->data;
if (timer->expires > now)
break;
- timer_list = g_list_remove_link(timer_list, cur);
- exec_list = g_list_concat(exec_list, cur);
-
+ tlist->list = g_list_delete_link(tlist->list, cur);
timer->on_list = false;
- }
- tlist->list = timer_list;
-
- tmp = exec_list;
- while (tmp) {
- timer = tmp->data;
- tmp = tmp->next;
timer->fired = true;
timer->cb(timer);
}
if (tlist->list) {
- timer_list = tlist->list;
- timer = timer_list->data;
+ timer = tlist->list->data;
if (timer->expires > now)
next_timeout = (timer->expires - now);
else
^ permalink raw reply related [flat|nested] 3+ messages in thread* Re: [Patch 01/12] CLD: fix crash in retransmissions
2010-04-18 4:37 [Patch 01/12] CLD: fix crash in retransmissions Pete Zaitcev
@ 2010-04-19 3:46 ` Jeff Garzik
2010-04-19 4:38 ` Pete Zaitcev
0 siblings, 1 reply; 3+ messages in thread
From: Jeff Garzik @ 2010-04-19 3:46 UTC (permalink / raw)
To: Pete Zaitcev; +Cc: Project Hail List
On 04/18/2010 12:37 AM, Pete Zaitcev wrote:
> For a longest time I was plagued by (very infrequent) crashes like this:
>
> Program received signal SIGSEGV, Segmentation fault.
> sess_retry_output (timer=0x92070c0) at session.c:532
> 532 if (!next_retry || (op->next_retry< next_retry))
> (gdb) info threads
> * 1 Thread 0xb72f96c0 (LWP 22417) sess_retry_output (timer=0x92070c0) at session.c:532
> (gdb) where
> #0 sess_retry_output (timer=0x92070c0) at session.c:532
> #1 session_retry (timer=0x92070c0) at session.c:565
> #2 0x08049aee in cld_timers_run (tlist=0x8056630) at ../lib/libtimer.c:95
> #3 0x0804e9cc in main_loop (argc=5, argv=0xbff70bd4) at server.c:983
> #4 main (argc=5, argv=0xbff70bd4) at server.c:1138
>
> The crash happens because op is NULL. As it turned out, this happens
> if a packet retransmit and a session expiration occur simultaneously
> (in the same pass of timers_run). The scenario is:
> - timers_run collects expired timers at exec list
> - timers_run expires session
> - two timer_del are called, but one of them is on exec list already,
> so it's ineffective
> - session is freed, this zeroes ->data in lists (later op)
> - timers_run continues along the exec list, invokes the retransmission
> callback, and that crashes with NULL op.
>
> The proposed solution is to rework the timers_run, again. But this
> time, we'll make it simpler by observing that timers are ordered by
> expiration time. Therefore, we can pull next timer off the list,
> expire it, and loop until expiration time is greater than the current
> time. No execution list is kept. The integrity of the main list
> is assured by never walking it and always referring to the head
> anew at each iteration.
>
> This patch appears to fix the problem and stands up to use that
> crashed the old code.
>
> Signed-off-by: Pete Zaitcev<zaitcev@redhat.com>
>
> ---
> include/cld_common.h | 10 ++++++----
> lib/libtimer.c | 41 ++++++++++++++++++++---------------------
> 2 files changed, 26 insertions(+), 25 deletions(-)
applied 1-4
Note that I change the email subject line prefix (which is normally
copied by automated tools directly into git) from "CLD" to "libcldc",
when committing to git. I am trying to follow (and encourage others to)
the kernel's method of using a prefix to indicate the subsystem or
section within the current git repo to which a change applies.
To be fully friendly with automated tools, an ideal Project Hail subject
line might read
[cld patch 1/1] update Makefile.am
or
[tabled patch 1/1] update Makefile.am
or
[cld patch 1/1] libcldc: add nncld API, the new new CLD API
Not a big deal, just noting what is most friendly to the git automated
tools, when I'm importing each Hail patch.
Jeff
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2010-04-19 4:38 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-18 4:37 [Patch 01/12] CLD: fix crash in retransmissions Pete Zaitcev
2010-04-19 3:46 ` Jeff Garzik
2010-04-19 4:38 ` Pete Zaitcev
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.