* [BUG, TEST PATCH] stallout race between SIGCONT and SIGSTOP
@ 2008-09-23 15:53 Joe Korty
2008-09-23 16:35 ` Oleg Nesterov
0 siblings, 1 reply; 7+ messages in thread
From: Joe Korty @ 2008-09-23 15:53 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: Roland McGrath, Jiri Kosina, Andrew Morton, linux-kernel
Since 2.6.25-git16, the Open POSIX Test Suite test sigaction/10-1 on
occasion stalls out. A ^C breaks the test out of the stall.
To see the problem, one must run the test in a loop. The stallout happens
anywhere from 3 to approximately 60 iterations. To make the test runtime
more bearable, I've been using a custom version that is 8x faster than
the original, s/sleep/usleep/g + new sleep constants.
The test in essence does 10 SIGSTOPs and SIGCONTs, interleaved, with a
short delay between each SIGSTOP and SIGCONT, but none (other than the
small delay of a printf) between each SIGCONT and SIGSTOP:
for(i=0; i<10; i++) {
printf("--> Sending SIGSTOP #%d\n", i);
kill (pid, SIGSTOP);
usleep(125000);
printf("--> Sending SIGCONT #%d\n", i);
kill (pid, SIGCONT);
// usleep(125000); /* this is missing from the real 10-1 */
}
When the above commented-out usleep is enabled, the stallout disappears.
If instead of adding a usleep, the printf's are removed, the test stalls
out immediately. Therefore the problem has something to do with a SIGSTOP
being issued 'too soon' after the issuance of a SIGCONT.
Bisection shows that the problem was introduced by
commit e442055193e4584218006e616c9bdce0c5e9ae5c
Author: Oleg Nesterov <oleg@tv-sign.ru>
Date: Wed Apr 30 00:52:44 2008 -0700
This commit adds code that solves serious race problems by deferring the
actual processing of SIGSTOP and SIGCONT to a later time. I suspect it
is this deferring that is making SIGCONT sensitive to a SIGSTOP coming
in too close on its heels.
The following patch, not to be considered seriously, lends credence to
that theory. It reverts a bit of the above commit, forcing SIGCONT (but
not SIGSTOP) to be processed immediately. With this patch, I achieved
some 1,400 runs before manually stopping the test.
Signed-off-by: Joe Korty <joe.korty@ccur.com>
Index: 2.6.27-rc6-git4/kernel/signal.c
===================================================================
--- 2.6.27-rc6-git4.orig/kernel/signal.c 2008-09-17 17:42:35.000000000 -0400
+++ 2.6.27-rc6-git4/kernel/signal.c 2008-09-22 16:07:48.000000000 -0400
@@ -598,6 +598,8 @@
return security_task_kill(t, info, sig, 0);
}
+static void do_notify_parent_cldstop(struct task_struct *tsk, int why);
+
/*
* Handle magic process-wide effects of stop/continue signals. Unlike
* the signal actions, these happen immediately at signal-generation
@@ -682,6 +684,10 @@
signal->flags = why | SIGNAL_STOP_CONTINUED;
signal->group_stop_count = 0;
signal->group_exit_code = 0;
+ signal->flags &= ~SIGNAL_CLD_MASK;
+ spin_unlock(&p->sighand->siglock);
+ do_notify_parent_cldstop(p, CLD_CONTINUED);
+ spin_lock(&p->sighand->siglock);
} else {
/*
* We are not stopped, but there could be a stop
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [BUG, TEST PATCH] stallout race between SIGCONT and SIGSTOP 2008-09-23 15:53 [BUG, TEST PATCH] stallout race between SIGCONT and SIGSTOP Joe Korty @ 2008-09-23 16:35 ` Oleg Nesterov 2008-09-24 15:05 ` Oleg Nesterov 0 siblings, 1 reply; 7+ messages in thread From: Oleg Nesterov @ 2008-09-23 16:35 UTC (permalink / raw) To: Joe Korty; +Cc: Roland McGrath, Jiri Kosina, Andrew Morton, linux-kernel Sorry! I have to run avay right now, and I will be completely offline tomorrow. I'll return on Thursday. On 09/23, Joe Korty wrote: > > Since 2.6.25-git16, the Open POSIX Test Suite test sigaction/10-1 on > occasion stalls out. A ^C breaks the test out of the stall. > > To see the problem, one must run the test in a loop. The stallout happens > anywhere from 3 to approximately 60 iterations. To make the test runtime > more bearable, I've been using a custom version that is 8x faster than > the original, s/sleep/usleep/g + new sleep constants. > > The test in essence does 10 SIGSTOPs and SIGCONTs, interleaved, with a > short delay between each SIGSTOP and SIGCONT, but none (other than the > small delay of a printf) between each SIGCONT and SIGSTOP: > > for(i=0; i<10; i++) { > printf("--> Sending SIGSTOP #%d\n", i); > kill (pid, SIGSTOP); > usleep(125000); > printf("--> Sending SIGCONT #%d\n", i); > kill (pid, SIGCONT); > // usleep(125000); /* this is missing from the real 10-1 */ > } > > When the above commented-out usleep is enabled, the stallout disappears. > If instead of adding a usleep, the printf's are removed, the test stalls > out immediately. Could you clarify? Do you mean that the task hangs in sys_kill() ? Better yet, to avoid a possible confusion, could you please send me the (modified) source code to re-produce the stall ? > Therefore the problem has something to do with a SIGSTOP > being issued 'too soon' after the issuance of a SIGCONT. > > Bisection shows that the problem was introduced by > > commit e442055193e4584218006e616c9bdce0c5e9ae5c > Author: Oleg Nesterov <oleg@tv-sign.ru> > Date: Wed Apr 30 00:52:44 2008 -0700 > > This commit adds code that solves serious race problems by deferring the > actual processing of SIGSTOP and SIGCONT to a later time. I suspect it > is this deferring that is making SIGCONT sensitive to a SIGSTOP coming > in too close on its heels. > > The following patch, not to be considered seriously, Yes, the patch is not for production, but thanks a lot! I am sure it will help to diagnose the problem. Thanks Joe! Oleg. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG, TEST PATCH] stallout race between SIGCONT and SIGSTOP 2008-09-23 16:35 ` Oleg Nesterov @ 2008-09-24 15:05 ` Oleg Nesterov 2008-09-24 15:19 ` Joe Korty 0 siblings, 1 reply; 7+ messages in thread From: Oleg Nesterov @ 2008-09-24 15:05 UTC (permalink / raw) To: Joe Korty; +Cc: Roland McGrath, Jiri Kosina, Andrew Morton, linux-kernel On 09/23, Oleg Nesterov wrote: > > I will be completely offline tomorrow. Cancelled... > Better yet, to avoid a possible confusion, could you please send me > the (modified) source code to re-produce the stall ? Thanks Joe for the test-case and for your comments. Joe says: > > So it looks like the test is in error, not the kernel. and I am happy to agree. I think sigaction/10-1.c should be fixed, please see the patch below. In essence the code does: for (1 .. 10) { kill(pid, SIGSTOP); wait_until_we_receive_CLD_STOPPED(); // <---- HANGS kill(pid, SIGCONT); } We can look at the code from the different angle, this loop does kill(pid, SIGCONT); kill(pid, SIGSTOP); wait_until_we_receive_CLD_STOPPED(); // <---- HANGS and it hangs because CLD_CONTINUED (which is ignored by the SIGCHLD handler) masks CLD_STOPPED. And yes, the mentioned patch makes the difference. To clarify, it does not defer the "result" of SIGCONT, it only defers the sending of SIGCHLD with .si_code == CLD_CONTINUED. So, what happens is: kill(SIGCONT) does its job, CLD_CONTINUED will be sent a bit later. kill(SIGSTOP) starts another group-stop. the child sends CLD_CONTINUED, and before the parent deques the signal, it stops and sends CLD_STOPPED. Now, since SIGCHLD is a non-rt signal, the second SIGCHLD is lost, and wait_until_we_receive_CLD_STOPPED() hangs. I did the test patch to be sure: --- 26-rc2/kernel/signal.c~ 2008-09-20 20:37:52.000000000 +0400 +++ 26-rc2/kernel/signal.c 2008-09-24 18:43:34.000000000 +0400 @@ -808,7 +808,7 @@ static int send_signal(int sig, struct s * exactly one non-rt signal, so that we can get more * detailed information about the cause of the signal. */ - if (legacy_queue(pending, sig)) + if (sig != SIGCHLD && legacy_queue(pending, sig)) return 0; /* * fast-pathed signals for kernel-internal things like SIGSTOP and now your test-case doesn't hang. Oleg. --- tmp/posixtestsuite/conformance/interfaces/sigaction/10-1.c~ 2003-04-15 01:18:58.000000000 +0400 +++ tmp/posixtestsuite/conformance/interfaces/sigaction/10-1.c 2008-09-24 18:01:23.000000000 +0400 @@ -19,19 +19,41 @@ #define NUMSTOPS 10 int child_stopped = 0; -int waiting = 1; +int child_continued = 0; +int notification; -void handler(int signo, siginfo_t *info, void *context) +void handler(int signo, siginfo_t *info, void *context) { - if (info && info->si_code == CLD_STOPPED) { + if (!info) + return; + + notification = info->si_code; + + switch (notification) { + case CLD_STOPPED: printf("Child has been stopped\n"); - waiting = 0; child_stopped++; + break; + case CLD_CONTINUED: + printf("Child has been continued\n"); + child_continued++; + break; } } +void wait_for_notification(int val) +{ + struct timeval tv; + + while (notification != val) { + tv.tv_sec = 1; + tv.tv_usec = 0; + if (!select(0, NULL, NULL, NULL, &tv)) + break; + } +} -int main() +int main(void) { pid_t pid; struct sigaction act; @@ -40,7 +62,7 @@ int main() act.sa_sigaction = handler; act.sa_flags = SA_SIGINFO; sigemptyset(&act.sa_mask); - sigaction(SIGCHLD, &act, 0); + sigaction(SIGCHLD, &act, 0); if ((pid = fork()) == 0) { /* child */ @@ -54,37 +76,36 @@ int main() return 0; } else { /* parent */ - int s; + int s; int i; for (i = 0; i < NUMSTOPS; i++) { - waiting = 1; - printf("--> Sending SIGSTOP\n"); + notification = 0; kill(pid, SIGSTOP); /* Don't let the kernel optimize away queued SIGSTOP/SIGCONT signals. */ - while (waiting) { - tv.tv_sec = 1; - tv.tv_usec = 0; - if (!select(0, NULL, NULL, NULL, &tv)) - break; - } - + wait_for_notification(CLD_STOPPED); printf("--> Sending SIGCONT\n"); + notification = 0; kill(pid, SIGCONT); + /* + SIGCHLD doesn't queue, make sure CLD_CONTINUED + doesn't mask the next CLD_STOPPED + */ + wait_for_notification(CLD_CONTINUED); } - + /* POSIX specifies default action to be abnormal termination */ kill(pid, SIGHUP); waitpid(pid, &s, 0); } - if (child_stopped == NUMSTOPS) { + if (child_stopped == NUMSTOPS && child_continued == NUMSTOPS) { printf("Test PASSED\n"); return 0; } @@ -92,4 +113,3 @@ int main() printf("Test FAILED\n"); return -1; } - ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG, TEST PATCH] stallout race between SIGCONT and SIGSTOP 2008-09-24 15:05 ` Oleg Nesterov @ 2008-09-24 15:19 ` Joe Korty 2008-09-24 15:56 ` Oleg Nesterov 0 siblings, 1 reply; 7+ messages in thread From: Joe Korty @ 2008-09-24 15:19 UTC (permalink / raw) To: Oleg Nesterov Cc: Roland McGrath, Jiri Kosina, Andrew Morton, linux-kernel@vger.kernel.org On Wed, Sep 24, 2008 at 11:05:41AM -0400, Oleg Nesterov wrote: > Joe says: >> So it looks like the test is in error, not the kernel. > > and I am happy to agree. > I think sigaction/10-1.c should be fixed, please see the patch below. A year or two ago I sent to Intel some OpenPosixTestSuite fixes, and they were accepted. Send it in (to the people listed in the comments at the front of the .c file), hopefully they are still at Intel. > Now, since SIGCHLD is a non-rt signal, the second SIGCHLD is lost, > and wait_until_we_receive_CLD_STOPPED() hangs. > > I did the test patch to be sure: > > --- 26-rc2/kernel/signal.c~ 2008-09-20 20:37:52.000000000 +0400 > +++ 26-rc2/kernel/signal.c 2008-09-24 18:43:34.000000000 +0400 > @@ -808,7 +808,7 @@ static int send_signal(int sig, struct s > * exactly one non-rt signal, so that we can get more > * detailed information about the cause of the signal. > */ > - if (legacy_queue(pending, sig)) > + if (sig != SIGCHLD && legacy_queue(pending, sig)) > return 0; > /* > * fast-pathed signals for kernel-internal things like SIGSTOP > > and now your test-case doesn't hang. Very interesting! I am not sure this is Posix conformant, as Posix seems to say that posting a SIGSTOP or SIGCHLD clears out all pending SIGSTOPs or SIGCHLDs, so queueing the SIGCHLD might violate the standard. Still it might be workable as it would be hard to see, from userspace, any behavioral difference between queueing (possibly illegal) and synchronous operation (which seems legal), both of which service all SIGCONTs and SIGSTOPs without loss. Joe ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG, TEST PATCH] stallout race between SIGCONT and SIGSTOP 2008-09-24 15:19 ` Joe Korty @ 2008-09-24 15:56 ` Oleg Nesterov 2008-09-24 17:11 ` Joe Korty 0 siblings, 1 reply; 7+ messages in thread From: Oleg Nesterov @ 2008-09-24 15:56 UTC (permalink / raw) To: Joe Korty Cc: Roland McGrath, Jiri Kosina, Andrew Morton, linux-kernel@vger.kernel.org On 09/24, Joe Korty wrote: > > On Wed, Sep 24, 2008 at 11:05:41AM -0400, Oleg Nesterov wrote: > > Joe says: > >> So it looks like the test is in error, not the kernel. > > > > and I am happy to agree. > > I think sigaction/10-1.c should be fixed, please see the patch below. > > A year or two ago I sent to Intel some OpenPosixTestSuite fixes, and they > were accepted. Send it in (to the people listed in the comments at the > front of the .c file), hopefully they are still at Intel. OK, thanks, will do. > > I did the test patch to be sure: > > > > --- 26-rc2/kernel/signal.c~ 2008-09-20 20:37:52.000000000 +0400 > > +++ 26-rc2/kernel/signal.c 2008-09-24 18:43:34.000000000 +0400 > > @@ -808,7 +808,7 @@ static int send_signal(int sig, struct s > > * exactly one non-rt signal, so that we can get more > > * detailed information about the cause of the signal. > > */ > > - if (legacy_queue(pending, sig)) > > + if (sig != SIGCHLD && legacy_queue(pending, sig)) > > return 0; > > /* > > * fast-pathed signals for kernel-internal things like SIGSTOP > > > > and now your test-case doesn't hang. > > Very interesting! I am not sure this is Posix conformant, No, no, the patch is of course wrong, I did it only to check my understanding. > as Posix > seems to say that posting a SIGSTOP or SIGCHLD clears out all pending > SIGSTOPs or SIGCHLDs, Hmm. Are you sure? Anyway, this is not what Linux does. If a non-rt signal is pending, the next signal with the same number is silently ignored. SIGCHLD too. > Still it might be workable Confused. Do you agree the kernel is not buggy? To clarify, none of SIGCONTs/SIGSTOPs is lost. But the test-case assumes that it must always receive SIGCHLD + CLD_STOPPED. This is not true because SIGCHLD is not queueable, and we have another "stream" of SIGCHLDs which carry CLD_CONTINUED. For example, the "opposite" code kill(SIGSTOP); kill(SIGCONT); wait_for_CLD_CONTINUED(); was always wrong, but kill(SIGCONT); kill(SIGSTOP); wait_for_CLD_STOPPED(); happened to work before that commit. But please note that it is wrong anyway. For example, if we have another sub-thread, we can miss CLD_STOPPED even without the commit which changed the timing. Oleg. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG, TEST PATCH] stallout race between SIGCONT and SIGSTOP 2008-09-24 15:56 ` Oleg Nesterov @ 2008-09-24 17:11 ` Joe Korty 2008-09-24 17:34 ` Oleg Nesterov 0 siblings, 1 reply; 7+ messages in thread From: Joe Korty @ 2008-09-24 17:11 UTC (permalink / raw) To: Oleg Nesterov Cc: Roland McGrath, Jiri Kosina, Andrew Morton, linux-kernel@vger.kernel.org On Wed, Sep 24, 2008 at 11:56:11AM -0400, Oleg Nesterov wrote: > Confused. Do you agree the kernel is not buggy? I agree with you that what the kernel is doing now is correct. Joe ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG, TEST PATCH] stallout race between SIGCONT and SIGSTOP 2008-09-24 17:11 ` Joe Korty @ 2008-09-24 17:34 ` Oleg Nesterov 0 siblings, 0 replies; 7+ messages in thread From: Oleg Nesterov @ 2008-09-24 17:34 UTC (permalink / raw) To: Joe Korty Cc: Roland McGrath, Jiri Kosina, Andrew Morton, linux-kernel@vger.kernel.org On 09/24, Joe Korty wrote: > > On Wed, Sep 24, 2008 at 11:56:11AM -0400, Oleg Nesterov wrote: > > Confused. Do you agree the kernel is not buggy? > > I agree with you that what the kernel is doing now is correct. Great. Thanks a lot for the report and discussion! And thanks for your analysis (sent privately), the problem was completely explained by you. BTW, I forgot to show the test-case you sent me privately, perhaps someone from CC might want to look... It really helped. Oleg. #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <sys/select.h> #include <sys/wait.h> #include <sys/time.h> #include <sys/types.h> #include <unistd.h> #define NUMSTOPS 10 int child_stopped = 0; int waiting = 1; void handler(int signo, siginfo_t *info, void *context) { if (info && info->si_code == CLD_STOPPED) { printf("Child has been stopped\n"); waiting = 0; child_stopped++; } } int main(void) { pid_t pid; struct sigaction act; struct timeval tv; act.sa_sigaction = handler; act.sa_flags = SA_SIGINFO; sigemptyset(&act.sa_mask); sigaction(SIGCHLD, &act, 0); setbuf(stdout, NULL); printf("%d SIGSTOP/SIGCONT combinations to be sent.\n", NUMSTOPS); if ((pid = fork()) == 0) { /* child */ while(1) { /* wait forever, or until we are interrupted by a signal */ tv.tv_sec = 0; tv.tv_usec = 0; select(0, NULL, NULL, NULL, &tv); } return 0; } else { /* parent */ int s; int i; usleep(12000); for (i = 0; i < NUMSTOPS; i++) { waiting = 1; printf("--> Sending SIGSTOP #%d\n", i); kill(pid, SIGSTOP); /* Don't let the kernel optimize away queued SIGSTOP/SIGCONT signals. */ while (waiting) usleep(12000); printf("--> Sending SIGCONT\n"); kill(pid, SIGCONT); // usleep(12000); } /* POSIX specifies default action to be abnormal termination */ usleep(12000); kill(pid, SIGKILL); waitpid(pid, &s, 0); } if (child_stopped == NUMSTOPS) { printf("Test PASSED\n"); return 0; } printf("Test FAILED\n"); return -1; } ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2008-09-24 17:29 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-09-23 15:53 [BUG, TEST PATCH] stallout race between SIGCONT and SIGSTOP Joe Korty 2008-09-23 16:35 ` Oleg Nesterov 2008-09-24 15:05 ` Oleg Nesterov 2008-09-24 15:19 ` Joe Korty 2008-09-24 15:56 ` Oleg Nesterov 2008-09-24 17:11 ` Joe Korty 2008-09-24 17:34 ` Oleg Nesterov
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.