From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff Garzik Subject: Re: [Patch 01/12] CLD: fix crash in retransmissions Date: Sun, 18 Apr 2010 23:46:07 -0400 Message-ID: <4BCBD1FF.3080909@garzik.org> References: <20100417223709.3f902df3@redhat.com> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:sender:message-id:date:from :user-agent:mime-version:to:cc:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=niG9RYnzRjTpDRpyRGdrYTpepAJ4dj0oDV5DLgiBSM4=; b=sTrwALT1CB1AXoi24dCaV3T95PBjXA6CXaX8TYHawl0zxlGNFzcA0sfvXm4KU0VHmL Xx7UrtBEaqEtKcLISyFkYZofovhBvnbzs7QGPl5uN0QR1QnDXHcIYL/zB9a5FOgf+atx Ummnz4EF+9MSwCh9HKqOVn4EzVxzmOBe+CzUI= In-Reply-To: <20100417223709.3f902df3@redhat.com> Sender: hail-devel-owner@vger.kernel.org List-ID: Content-Type: text/plain; charset="us-ascii"; format="flowed" To: Pete Zaitcev Cc: Project Hail List On 04/18/2010 12:37 AM, Pete Zaitcev wrote: > For a longest time I was plagued by (very infrequent) crashes like this: > > Program received signal SIGSEGV, Segmentation fault. > sess_retry_output (timer=0x92070c0) at session.c:532 > 532 if (!next_retry || (op->next_retry< next_retry)) > (gdb) info threads > * 1 Thread 0xb72f96c0 (LWP 22417) sess_retry_output (timer=0x92070c0) at session.c:532 > (gdb) where > #0 sess_retry_output (timer=0x92070c0) at session.c:532 > #1 session_retry (timer=0x92070c0) at session.c:565 > #2 0x08049aee in cld_timers_run (tlist=0x8056630) at ../lib/libtimer.c:95 > #3 0x0804e9cc in main_loop (argc=5, argv=0xbff70bd4) at server.c:983 > #4 main (argc=5, argv=0xbff70bd4) at server.c:1138 > > The crash happens because op is NULL. As it turned out, this happens > if a packet retransmit and a session expiration occur simultaneously > (in the same pass of timers_run). The scenario is: > - timers_run collects expired timers at exec list > - timers_run expires session > - two timer_del are called, but one of them is on exec list already, > so it's ineffective > - session is freed, this zeroes ->data in lists (later op) > - timers_run continues along the exec list, invokes the retransmission > callback, and that crashes with NULL op. > > The proposed solution is to rework the timers_run, again. But this > time, we'll make it simpler by observing that timers are ordered by > expiration time. Therefore, we can pull next timer off the list, > expire it, and loop until expiration time is greater than the current > time. No execution list is kept. The integrity of the main list > is assured by never walking it and always referring to the head > anew at each iteration. > > This patch appears to fix the problem and stands up to use that > crashed the old code. > > Signed-off-by: Pete Zaitcev > > --- > include/cld_common.h | 10 ++++++---- > lib/libtimer.c | 41 ++++++++++++++++++++--------------------- > 2 files changed, 26 insertions(+), 25 deletions(-) applied 1-4 Note that I change the email subject line prefix (which is normally copied by automated tools directly into git) from "CLD" to "libcldc", when committing to git. I am trying to follow (and encourage others to) the kernel's method of using a prefix to indicate the subsystem or section within the current git repo to which a change applies. To be fully friendly with automated tools, an ideal Project Hail subject line might read [cld patch 1/1] update Makefile.am or [tabled patch 1/1] update Makefile.am or [cld patch 1/1] libcldc: add nncld API, the new new CLD API Not a big deal, just noting what is most friendly to the git automated tools, when I'm importing each Hail patch. Jeff