All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG, TEST PATCH] stallout race between SIGCONT and SIGSTOP
@ 2008-09-23 15:53 Joe Korty
  2008-09-23 16:35 ` Oleg Nesterov
  0 siblings, 1 reply; 7+ messages in thread
From: Joe Korty @ 2008-09-23 15:53 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Roland McGrath, Jiri Kosina, Andrew Morton, linux-kernel

Since 2.6.25-git16, the Open POSIX Test Suite test sigaction/10-1 on
occasion stalls out.  A ^C breaks the test out of the stall.

To see the problem, one must run the test in a loop.  The stallout happens
anywhere from 3 to approximately 60 iterations.  To make the test runtime
more bearable, I've been using a custom version that is 8x faster than
the original, s/sleep/usleep/g + new sleep constants.

The test in essence does 10 SIGSTOPs and SIGCONTs, interleaved, with a
short delay between each SIGSTOP and SIGCONT, but none (other than the
small delay of a printf) between each SIGCONT and SIGSTOP:

    for(i=0; i<10; i++) {
	printf("--> Sending SIGSTOP #%d\n", i);
	kill (pid, SIGSTOP);
	usleep(125000);
	printf("--> Sending SIGCONT #%d\n", i);
	kill (pid, SIGCONT);
	// usleep(125000); /* this is missing from the real 10-1 */
    }

When the above commented-out usleep is enabled, the stallout disappears.
If instead of adding a usleep, the printf's are removed, the test stalls
out immediately.  Therefore the problem has something to do with a SIGSTOP
being issued 'too soon' after the issuance of a SIGCONT.

Bisection shows that the problem was introduced by

    commit e442055193e4584218006e616c9bdce0c5e9ae5c
    Author: Oleg Nesterov <oleg@tv-sign.ru>
    Date:   Wed Apr 30 00:52:44 2008 -0700

This commit adds code that solves serious race problems by deferring the
actual processing of SIGSTOP and SIGCONT to a later time.  I suspect it
is this deferring that is making SIGCONT sensitive to a SIGSTOP coming
in too close on its heels.

The following patch, not to be considered seriously, lends credence to
that theory.  It reverts a bit of the above commit, forcing SIGCONT (but
not SIGSTOP) to be processed immediately.  With this patch, I achieved
some 1,400 runs before manually stopping the test.

Signed-off-by: Joe Korty <joe.korty@ccur.com>

Index: 2.6.27-rc6-git4/kernel/signal.c
===================================================================
--- 2.6.27-rc6-git4.orig/kernel/signal.c	2008-09-17 17:42:35.000000000 -0400
+++ 2.6.27-rc6-git4/kernel/signal.c	2008-09-22 16:07:48.000000000 -0400
@@ -598,6 +598,8 @@
 	return security_task_kill(t, info, sig, 0);
 }
 
+static void do_notify_parent_cldstop(struct task_struct *tsk, int why);
+
 /*
  * Handle magic process-wide effects of stop/continue signals. Unlike
  * the signal actions, these happen immediately at signal-generation
@@ -682,6 +684,10 @@
 			signal->flags = why | SIGNAL_STOP_CONTINUED;
 			signal->group_stop_count = 0;
 			signal->group_exit_code = 0;
+			signal->flags &= ~SIGNAL_CLD_MASK;
+			spin_unlock(&p->sighand->siglock);
+			do_notify_parent_cldstop(p, CLD_CONTINUED);
+			spin_lock(&p->sighand->siglock);
 		} else {
 			/*
 			 * We are not stopped, but there could be a stop

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG, TEST PATCH] stallout race between SIGCONT and SIGSTOP
  2008-09-23 15:53 [BUG, TEST PATCH] stallout race between SIGCONT and SIGSTOP Joe Korty
@ 2008-09-23 16:35 ` Oleg Nesterov
  2008-09-24 15:05   ` Oleg Nesterov
  0 siblings, 1 reply; 7+ messages in thread
From: Oleg Nesterov @ 2008-09-23 16:35 UTC (permalink / raw)
  To: Joe Korty; +Cc: Roland McGrath, Jiri Kosina, Andrew Morton, linux-kernel

Sorry! I have to run avay right now, and I will be completely offline
tomorrow. I'll return on Thursday.

On 09/23, Joe Korty wrote:
>
> Since 2.6.25-git16, the Open POSIX Test Suite test sigaction/10-1 on
> occasion stalls out.  A ^C breaks the test out of the stall.
>
> To see the problem, one must run the test in a loop.  The stallout happens
> anywhere from 3 to approximately 60 iterations.  To make the test runtime
> more bearable, I've been using a custom version that is 8x faster than
> the original, s/sleep/usleep/g + new sleep constants.
>
> The test in essence does 10 SIGSTOPs and SIGCONTs, interleaved, with a
> short delay between each SIGSTOP and SIGCONT, but none (other than the
> small delay of a printf) between each SIGCONT and SIGSTOP:
>
>     for(i=0; i<10; i++) {
> 	printf("--> Sending SIGSTOP #%d\n", i);
> 	kill (pid, SIGSTOP);
> 	usleep(125000);
> 	printf("--> Sending SIGCONT #%d\n", i);
> 	kill (pid, SIGCONT);
> 	// usleep(125000); /* this is missing from the real 10-1 */
>     }
>
> When the above commented-out usleep is enabled, the stallout disappears.
> If instead of adding a usleep, the printf's are removed, the test stalls
> out immediately.

Could you clarify? Do you mean that the task hangs in sys_kill() ?

Better yet, to avoid a possible confusion, could you please send me
the (modified) source code to re-produce the stall ?

> Therefore the problem has something to do with a SIGSTOP
> being issued 'too soon' after the issuance of a SIGCONT.
>
> Bisection shows that the problem was introduced by
>
>     commit e442055193e4584218006e616c9bdce0c5e9ae5c
>     Author: Oleg Nesterov <oleg@tv-sign.ru>
>     Date:   Wed Apr 30 00:52:44 2008 -0700
>
> This commit adds code that solves serious race problems by deferring the
> actual processing of SIGSTOP and SIGCONT to a later time.  I suspect it
> is this deferring that is making SIGCONT sensitive to a SIGSTOP coming
> in too close on its heels.
>
> The following patch, not to be considered seriously,

Yes, the patch is not for production, but thanks a lot! I am sure it will
help to diagnose the problem.

Thanks Joe!

Oleg.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG, TEST PATCH] stallout race between SIGCONT and SIGSTOP
  2008-09-23 16:35 ` Oleg Nesterov
@ 2008-09-24 15:05   ` Oleg Nesterov
  2008-09-24 15:19     ` Joe Korty
  0 siblings, 1 reply; 7+ messages in thread
From: Oleg Nesterov @ 2008-09-24 15:05 UTC (permalink / raw)
  To: Joe Korty; +Cc: Roland McGrath, Jiri Kosina, Andrew Morton, linux-kernel

On 09/23, Oleg Nesterov wrote:
>
> I will be completely offline tomorrow.

Cancelled...

> Better yet, to avoid a possible confusion, could you please send me
> the (modified) source code to re-produce the stall ?

Thanks Joe for the test-case and for your comments.

Joe says:
>
> So it looks like the test is in error, not the kernel.

and I am happy to agree.

I think sigaction/10-1.c should be fixed, please see the patch below.

In essence the code does:

	for (1 .. 10) {
		kill(pid, SIGSTOP);
		wait_until_we_receive_CLD_STOPPED(); // <---- HANGS
		kill(pid, SIGCONT);
	}

We can look at the code from the different angle, this loop does

		kill(pid, SIGCONT);
		kill(pid, SIGSTOP);
		wait_until_we_receive_CLD_STOPPED(); // <---- HANGS

and it hangs because CLD_CONTINUED (which is ignored by the SIGCHLD
handler) masks CLD_STOPPED.

And yes, the mentioned patch makes the difference. To clarify, it
does not defer the "result" of SIGCONT, it only defers the sending
of SIGCHLD with .si_code == CLD_CONTINUED.

So, what happens is:

	kill(SIGCONT) does its job, CLD_CONTINUED will be sent a
	bit later.

	kill(SIGSTOP) starts another group-stop.

	the child sends CLD_CONTINUED, and before the parent deques
	the signal, it stops and sends CLD_STOPPED.

Now, since SIGCHLD is a non-rt signal, the second SIGCHLD is lost,
and wait_until_we_receive_CLD_STOPPED() hangs.

I did the test patch to be sure:

	--- 26-rc2/kernel/signal.c~	2008-09-20 20:37:52.000000000 +0400
	+++ 26-rc2/kernel/signal.c	2008-09-24 18:43:34.000000000 +0400
	@@ -808,7 +808,7 @@ static int send_signal(int sig, struct s
		 * exactly one non-rt signal, so that we can get more
		 * detailed information about the cause of the signal.
		 */
	-	if (legacy_queue(pending, sig))
	+	if (sig != SIGCHLD && legacy_queue(pending, sig))
			return 0;
		/*
		 * fast-pathed signals for kernel-internal things like SIGSTOP

and now your test-case doesn't hang.

Oleg.

--- tmp/posixtestsuite/conformance/interfaces/sigaction/10-1.c~	2003-04-15 01:18:58.000000000 +0400
+++ tmp/posixtestsuite/conformance/interfaces/sigaction/10-1.c	2008-09-24 18:01:23.000000000 +0400
@@ -19,19 +19,41 @@
 #define NUMSTOPS 10
 
 int child_stopped = 0;
-int waiting = 1;
+int child_continued = 0;
+int notification;
 
-void handler(int signo, siginfo_t *info, void *context) 
+void handler(int signo, siginfo_t *info, void *context)
 {
-	if (info && info->si_code == CLD_STOPPED) {
+	if (!info)
+		return;
+
+	notification = info->si_code;
+
+	switch (notification) {
+	case CLD_STOPPED:
 		printf("Child has been stopped\n");
-		waiting = 0;
 		child_stopped++;
+		break;
+	case CLD_CONTINUED:
+		printf("Child has been continued\n");
+		child_continued++;
+		break;
 	}
 }
 
+void wait_for_notification(int val)
+{
+	struct timeval tv;
+
+	while (notification != val) {
+		tv.tv_sec = 1;
+		tv.tv_usec = 0;
+		if (!select(0, NULL, NULL, NULL, &tv))
+			break;
+	}
+}
 
-int main()
+int main(void)
 {
 	pid_t pid;
 	struct sigaction act;
@@ -40,7 +62,7 @@ int main()
 	act.sa_sigaction = handler;
 	act.sa_flags = SA_SIGINFO;
 	sigemptyset(&act.sa_mask);
-	sigaction(SIGCHLD,  &act, 0);     
+	sigaction(SIGCHLD,  &act, 0);
 
 	if ((pid = fork()) == 0) {
 		/* child */
@@ -54,37 +76,36 @@ int main()
 		return 0;
 	} else {
 		/* parent */
-		int s; 		
+		int s;
 		int i;
 
 		for (i = 0; i < NUMSTOPS; i++) {
-			waiting = 1;
-
 			printf("--> Sending SIGSTOP\n");
+			notification = 0;
 			kill(pid, SIGSTOP);
 
 			/*
 			  Don't let the kernel optimize away queued
 			  SIGSTOP/SIGCONT signals.
 			*/
-			while (waiting) {
-				tv.tv_sec = 1;
-				tv.tv_usec = 0;
-				if (!select(0, NULL, NULL, NULL, &tv))
-				  break;
-			}
-
+			wait_for_notification(CLD_STOPPED);
 
 			printf("--> Sending SIGCONT\n");
+			notification = 0;
 			kill(pid, SIGCONT);
+			/*
+			  SIGCHLD doesn't queue, make sure CLD_CONTINUED
+			  doesn't mask the next CLD_STOPPED
+			*/
+			wait_for_notification(CLD_CONTINUED);
 		}
-		
+
 		/* POSIX specifies default action to be abnormal termination */
 		kill(pid, SIGHUP);
 		waitpid(pid, &s, 0);
 	}
 
-	if (child_stopped == NUMSTOPS) {
+	if (child_stopped == NUMSTOPS && child_continued == NUMSTOPS) {
 		printf("Test PASSED\n");
 		return 0;
 	}
@@ -92,4 +113,3 @@ int main()
 	printf("Test FAILED\n");
 	return -1;
 }
-


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG, TEST PATCH] stallout race between SIGCONT and SIGSTOP
  2008-09-24 15:05   ` Oleg Nesterov
@ 2008-09-24 15:19     ` Joe Korty
  2008-09-24 15:56       ` Oleg Nesterov
  0 siblings, 1 reply; 7+ messages in thread
From: Joe Korty @ 2008-09-24 15:19 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Roland McGrath, Jiri Kosina, Andrew Morton,
	linux-kernel@vger.kernel.org

On Wed, Sep 24, 2008 at 11:05:41AM -0400, Oleg Nesterov wrote:
> Joe says:
>> So it looks like the test is in error, not the kernel.
> 
> and I am happy to agree.
> I think sigaction/10-1.c should be fixed, please see the patch below.

A year or two ago I sent to Intel some OpenPosixTestSuite fixes, and they
were accepted.  Send it in (to the people listed in the comments at the
front of the .c file), hopefully they are still at Intel.


> Now, since SIGCHLD is a non-rt signal, the second SIGCHLD is lost,
> and wait_until_we_receive_CLD_STOPPED() hangs.
> 
> I did the test patch to be sure:
> 
>         --- 26-rc2/kernel/signal.c~     2008-09-20 20:37:52.000000000 +0400
>         +++ 26-rc2/kernel/signal.c      2008-09-24 18:43:34.000000000 +0400
>         @@ -808,7 +808,7 @@ static int send_signal(int sig, struct s
>                  * exactly one non-rt signal, so that we can get more
>                  * detailed information about the cause of the signal.
>                  */
>         -       if (legacy_queue(pending, sig))
>         +       if (sig != SIGCHLD && legacy_queue(pending, sig))
>                         return 0;
>                 /*
>                  * fast-pathed signals for kernel-internal things like SIGSTOP
> 
> and now your test-case doesn't hang.

Very interesting!  I am not sure this is Posix conformant, as Posix
seems to say that posting a SIGSTOP or SIGCHLD clears out all pending
SIGSTOPs or SIGCHLDs, so queueing the SIGCHLD might violate the standard.
Still it might be workable as it would be hard to see, from userspace, any
behavioral difference between queueing (possibly illegal) and synchronous
operation (which seems legal), both of which service all SIGCONTs and
SIGSTOPs without loss.

Joe

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG, TEST PATCH] stallout race between SIGCONT and SIGSTOP
  2008-09-24 15:19     ` Joe Korty
@ 2008-09-24 15:56       ` Oleg Nesterov
  2008-09-24 17:11         ` Joe Korty
  0 siblings, 1 reply; 7+ messages in thread
From: Oleg Nesterov @ 2008-09-24 15:56 UTC (permalink / raw)
  To: Joe Korty
  Cc: Roland McGrath, Jiri Kosina, Andrew Morton,
	linux-kernel@vger.kernel.org

On 09/24, Joe Korty wrote:
>
> On Wed, Sep 24, 2008 at 11:05:41AM -0400, Oleg Nesterov wrote:
> > Joe says:
> >> So it looks like the test is in error, not the kernel.
> >
> > and I am happy to agree.
> > I think sigaction/10-1.c should be fixed, please see the patch below.
>
> A year or two ago I sent to Intel some OpenPosixTestSuite fixes, and they
> were accepted.  Send it in (to the people listed in the comments at the
> front of the .c file), hopefully they are still at Intel.

OK, thanks, will do.

> > I did the test patch to be sure:
> >
> >         --- 26-rc2/kernel/signal.c~     2008-09-20 20:37:52.000000000 +0400
> >         +++ 26-rc2/kernel/signal.c      2008-09-24 18:43:34.000000000 +0400
> >         @@ -808,7 +808,7 @@ static int send_signal(int sig, struct s
> >                  * exactly one non-rt signal, so that we can get more
> >                  * detailed information about the cause of the signal.
> >                  */
> >         -       if (legacy_queue(pending, sig))
> >         +       if (sig != SIGCHLD && legacy_queue(pending, sig))
> >                         return 0;
> >                 /*
> >                  * fast-pathed signals for kernel-internal things like SIGSTOP
> >
> > and now your test-case doesn't hang.
>
> Very interesting!  I am not sure this is Posix conformant,

No, no, the patch is of course wrong, I did it only to check my
understanding.

> as Posix
> seems to say that posting a SIGSTOP or SIGCHLD clears out all pending
> SIGSTOPs or SIGCHLDs,

Hmm. Are you sure?

Anyway, this is not what Linux does. If a non-rt signal is pending, the
next signal with the same number is silently ignored. SIGCHLD too.

> Still it might be workable

Confused. Do you agree the kernel is not buggy?

To clarify, none of SIGCONTs/SIGSTOPs is lost. But the test-case assumes
that it must always receive SIGCHLD + CLD_STOPPED. This is not true because
SIGCHLD is not queueable, and we have another "stream" of SIGCHLDs which
carry CLD_CONTINUED.

For example, the "opposite" code

	kill(SIGSTOP);
	kill(SIGCONT);
	wait_for_CLD_CONTINUED();

was always wrong, but

	kill(SIGCONT);
	kill(SIGSTOP);
	wait_for_CLD_STOPPED();

happened to work before that commit. But please note that it is wrong
anyway. For example, if we have another sub-thread, we can miss
CLD_STOPPED even without the commit which changed the timing.

Oleg.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG, TEST PATCH] stallout race between SIGCONT and SIGSTOP
  2008-09-24 15:56       ` Oleg Nesterov
@ 2008-09-24 17:11         ` Joe Korty
  2008-09-24 17:34           ` Oleg Nesterov
  0 siblings, 1 reply; 7+ messages in thread
From: Joe Korty @ 2008-09-24 17:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Roland McGrath, Jiri Kosina, Andrew Morton,
	linux-kernel@vger.kernel.org

On Wed, Sep 24, 2008 at 11:56:11AM -0400, Oleg Nesterov wrote:
> Confused. Do you agree the kernel is not buggy?

I agree with you that what the kernel is doing now is correct.
Joe

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG, TEST PATCH] stallout race between SIGCONT and SIGSTOP
  2008-09-24 17:11         ` Joe Korty
@ 2008-09-24 17:34           ` Oleg Nesterov
  0 siblings, 0 replies; 7+ messages in thread
From: Oleg Nesterov @ 2008-09-24 17:34 UTC (permalink / raw)
  To: Joe Korty
  Cc: Roland McGrath, Jiri Kosina, Andrew Morton,
	linux-kernel@vger.kernel.org

On 09/24, Joe Korty wrote:
>
> On Wed, Sep 24, 2008 at 11:56:11AM -0400, Oleg Nesterov wrote:
> > Confused. Do you agree the kernel is not buggy?
>
> I agree with you that what the kernel is doing now is correct.

Great. Thanks a lot for the report and discussion! And thanks
for your analysis (sent privately), the problem was completely
explained by you.

BTW, I forgot to show the test-case you sent me privately, perhaps
someone from CC might want to look... It really helped.

Oleg.

#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/select.h>
#include <sys/wait.h>
#include <sys/time.h>
#include <sys/types.h>
#include <unistd.h>

#define NUMSTOPS 10

int child_stopped = 0;
int waiting = 1;

void handler(int signo, siginfo_t *info, void *context) 
{
	if (info && info->si_code == CLD_STOPPED) {
		printf("Child has been stopped\n");
		waiting = 0;
		child_stopped++;
	}
}


int main(void)
{
	pid_t pid;
	struct sigaction act;
	struct timeval tv;

	act.sa_sigaction = handler;
	act.sa_flags = SA_SIGINFO;
	sigemptyset(&act.sa_mask);
	sigaction(SIGCHLD,  &act, 0);     

	setbuf(stdout, NULL);
	printf("%d SIGSTOP/SIGCONT combinations to be sent.\n", NUMSTOPS);

	if ((pid = fork()) == 0) {
		/* child */
		while(1) {
			/* wait forever, or until we are 
			   interrupted by a signal */
			tv.tv_sec = 0;
			tv.tv_usec = 0;
			select(0, NULL, NULL, NULL, &tv);
		}
		return 0;
	} else {
		/* parent */
		int s; 		
		int i;

		usleep(12000);
		for (i = 0; i < NUMSTOPS; i++) {
			waiting = 1;

			printf("--> Sending SIGSTOP #%d\n", i);
			kill(pid, SIGSTOP);

			/*
			  Don't let the kernel optimize away queued
			  SIGSTOP/SIGCONT signals.
			*/
			while (waiting)
				usleep(12000);

			printf("--> Sending SIGCONT\n");
			kill(pid, SIGCONT);
			// usleep(12000);
		}

		/* POSIX specifies default action to be abnormal termination */
		usleep(12000);
		kill(pid, SIGKILL);
		waitpid(pid, &s, 0);
	}

	if (child_stopped == NUMSTOPS) {
		printf("Test PASSED\n");
		return 0;
	}

	printf("Test FAILED\n");
	return -1;
}


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-09-24 17:29 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-23 15:53 [BUG, TEST PATCH] stallout race between SIGCONT and SIGSTOP Joe Korty
2008-09-23 16:35 ` Oleg Nesterov
2008-09-24 15:05   ` Oleg Nesterov
2008-09-24 15:19     ` Joe Korty
2008-09-24 15:56       ` Oleg Nesterov
2008-09-24 17:11         ` Joe Korty
2008-09-24 17:34           ` Oleg Nesterov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.