Possible bug introduced in commit 9b84cca

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Possible bug introduced in commit 9b84cca
@ 2011-12-28 18:55 Denys Vlasenko
  2011-12-28 21:07 ` Denys Vlasenko
  2011-12-29 11:32 ` Oleg Nesterov
  0 siblings, 2 replies; 15+ messages in thread
From: Denys Vlasenko @ 2011-12-28 18:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Denys Vlasenko, Oleg Nesterov, linux-kernel, Łukasz Michalik,
	Dmitry V. Levin

[-- Attachment #1: Type: text/plain, Size: 1887 bytes --]

Hi Tejun, Oleg,

Apologies if you are already informed about this bug
by people who originally discovered it.

Looks like after commit 9b84cca, waitpid under strace
sometimes returns bogus ECHILD while child does exist.

I did not yet confirm that the bug appeared exactly
at this commit - Łukasz says that.

I confirmed that bug exists on kernels 3.1.6 (in Fedora)
and 3.1.0-rc4 (vanilla).

We have a testcase which spawns N threads, each of them
performs an infinite loop "fork, exit in child, waitpid
in parent for the child". When straced, sometimes waitpid
returns ECHILD. In fact, there is no need to run many threads -
I just saw it happening with single thread on 4-CPU machine
when I ran "strace -otestcase1.LOG -f ./testcase1 1".
This machine uses 3.1.0-rc4.

Please find testcase attached.

Also please find testcase1.LOG attached.
The key part is here:

931   clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf763dbd8) = 1048
1048  exit_group(42)                    = ?
931   waitpid(1048,  <unfinished ...>
1048  +++ exited with 42 +++
931   <... waitpid resumed> 0xf763d3a0, 0) = -1 ECHILD (No child processes)

To complicate matters, this is observed only under development
version of strace. Old (released) versions of strace do not
let ptraced processes to die - they detach from them when
they think they are going to die (such as when they enter _exit()
or receive a "deadly" signal). Which is a aesthetically horrible and
logically buggy (racy) hack, so we are removing it from strace.
Łukasz says that old strace versions (ones which still use the hack)
don't trigger the bug.

For testing, I will send you strace source tree and pre-compiled
strace binary in a separate email. Alternatively, pull latest
strace git and "autoreconf -fvi && ./configure && make" it.

-- 
vda

[-- Attachment #2: testcase1.c --]
[-- Type: text/x-csrc, Size: 948 bytes --]

#include <errno.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/sysinfo.h>
#include <sys/wait.h>
#include <stdio.h>
#include <pthread.h>

void* worker(void *arg)
{
	while (1) {
		pid_t p = fork();
		if (-1 == p) { /* error */
			perror("fork");
			_exit(EXIT_FAILURE);
		}
		if (0 == p) { /* child */
			_exit(42);
		}
		/* parent */
		int stat_loc;
		int s = waitpid(p, &stat_loc, 0);
		if (-1 == s) {
			perror("waitpid");
			_exit(EXIT_FAILURE);
		}
	}
}

int main(int argc, char **argv)
{
	int pool_size = get_nprocs() * 4;

	if (argv[1])
		pool_size = atoi(argv[1]);
	printf("Poolsize: %d\n", pool_size);

	pthread_t thread_id;
	int i;
	for (i = 0; i != pool_size; ++i) {
		if (pthread_create(&thread_id, NULL, worker, NULL) != 0) {
			perror("pthread_create");
			_exit(EXIT_FAILURE);
		}
	}

	/* Prevent exiting: wait for last thread (forever) */
	void *retval;
	pthread_join(thread_id, &retval);

	return 43;
}

[-- Attachment #3: testcase1.LOG.bz2 --]
[-- Type: application/x-bzip2, Size: 3172 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Possible bug introduced in commit 9b84cca
  2011-12-28 18:55 Possible bug introduced in commit 9b84cca Denys Vlasenko
@ 2011-12-28 21:07 ` Denys Vlasenko
  2011-12-28 21:23   ` Łukasz Michalik
  2011-12-29 11:32 ` Oleg Nesterov
  1 sibling, 1 reply; 15+ messages in thread
From: Denys Vlasenko @ 2011-12-28 21:07 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Tejun Heo, Oleg Nesterov, linux-kernel, Łukasz Michalik,
	Dmitry V. Levin

On 12/28/2011 07:55 PM, Denys Vlasenko wrote:
> Hi Tejun, Oleg,
>
> Apologies if you are already informed about this bug
> by people who originally discovered it.
>
> Looks like after commit 9b84cca, waitpid under strace
> sometimes returns bogus ECHILD while child does exist.
>
> I did not yet confirm that the bug appeared exactly
> at this commit - Łukasz says that.
>
> I confirmed that bug exists on kernels 3.1.6 (in Fedora)
> and 3.1.0-rc4 (vanilla).

I tested it under a few more Fedora kernels.
On kernel-PAE-2.6.39-1.fc16.i686 bug does not trigger.
On next Fedora kernel, kernel-PAE-3.0-0.rc1.git0.2.fc16.i686,
it triggers.

And indeed, these kernels differ in kernel/exit.c::wait_consider_task()
only by commit 9b84cca.

Oleg, if you need to test it, I have the setup on my machine.

On Thu, Nov 17, 2011 at 10:25:26AM +0100, Łukasz Michalik wrote:
 > The bug only happens on recent linux.  I've bisected the kernel and
 > found that it was introduced by 9b84cca2564b9 [2] in linux.git, so
 > pretty much everything since 3.0-rc2 is affected.

Łukasz, looks like 3.0-rc1 is affected too.

-- 
vda

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Possible bug introduced in commit 9b84cca
  2011-12-28 21:07 ` Denys Vlasenko
@ 2011-12-28 21:23   ` Łukasz Michalik
  0 siblings, 0 replies; 15+ messages in thread
From: Łukasz Michalik @ 2011-12-28 21:23 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Denys Vlasenko, Tejun Heo, Oleg Nesterov, linux-kernel,
	Dmitry V. Levin

[-- Attachment #1: Type: text/plain, Size: 524 bytes --]

On 22:07 2011-12-28 +0100, Denys Vlasenko wrote:
> On Thu, Nov 17, 2011 at 10:25:26AM +0100, Łukasz Michalik wrote:
> > The bug only happens on recent linux.  I've bisected the kernel and
> > found that it was introduced by 9b84cca2564b9 [2] in linux.git, so
> > pretty much everything since 3.0-rc2 is affected.
> 
> Łukasz, looks like 3.0-rc1 is affected too.
> 

Yeah, my git foo failed me back then.  3.0-rc1 is the the first
upstream tag that contains that change.

-- 
Pozdrawiam,
Łukasz P. Michalik

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Possible bug introduced in commit 9b84cca
  2011-12-28 18:55 Possible bug introduced in commit 9b84cca Denys Vlasenko
  2011-12-28 21:07 ` Denys Vlasenko
@ 2011-12-29 11:32 ` Oleg Nesterov
  2011-12-29 12:05   ` Oleg Nesterov
  1 sibling, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2011-12-29 11:32 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Tejun Heo, Denys Vlasenko, linux-kernel, Łukasz Michalik,
	Dmitry V. Levin

On 12/28, Denys Vlasenko wrote:
>
> Looks like after commit 9b84cca, waitpid under strace
> sometimes returns bogus ECHILD while child does exist.
>
> I did not yet confirm that the bug appeared exactly
> at this commit - Łukasz says that.
>
> I confirmed that bug exists on kernels 3.1.6 (in Fedora)
> and 3.1.0-rc4 (vanilla).
>
> We have a testcase which spawns N threads, each of them
> performs an infinite loop "fork, exit in child, waitpid
> in parent for the child". When straced, sometimes waitpid
> returns ECHILD.

You mean, the natural parent gets ECHILD, not strace?

> The key part is here:
>
> 931   clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf763dbd8) = 1048
> 1048  exit_group(42)                    = ?
> 931   waitpid(1048,  <unfinished ...>
> 1048  +++ exited with 42 +++
> 931   <... waitpid resumed> 0xf763d3a0, 0) = -1 ECHILD (No child processes)

Argh. I seem to understand

I didn't check, but I think the offending commit is 823b018e5b1196d8
"job control: Small reorganization of wait_consider_task()".

ptracer sees EXIT_ZOMBIE and temporary sets EXIT_DEAD, this fools
the ->real_parent.

I need to think. The fix should be simple, but perhaps it is the
time to kill EXIT_DEAD altogether. I'll try to make the patch
after vacation. In the next year ;)

Thanks a lot Denys!

Oleg.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Possible bug introduced in commit 9b84cca
  2011-12-29 11:32 ` Oleg Nesterov
@ 2011-12-29 12:05   ` Oleg Nesterov
  2012-01-03 14:29     ` Oleg Nesterov
  0 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2011-12-29 12:05 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Tejun Heo, Denys Vlasenko, linux-kernel, Łukasz Michalik,
	Dmitry V. Levin

On 12/29, Oleg Nesterov wrote:
>
> On 12/28, Denys Vlasenko wrote:
> >
> > Looks like after commit 9b84cca, waitpid under strace
> > sometimes returns bogus ECHILD while child does exist.
> >
> > I did not yet confirm that the bug appeared exactly
> > at this commit - Łukasz says that.
> >
> > I confirmed that bug exists on kernels 3.1.6 (in Fedora)
> > and 3.1.0-rc4 (vanilla).
> >
> > We have a testcase which spawns N threads, each of them
> > performs an infinite loop "fork, exit in child, waitpid
> > in parent for the child". When straced, sometimes waitpid
> > returns ECHILD.
>
> You mean, the natural parent gets ECHILD, not strace?
>
> > The key part is here:
> >
> > 931   clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf763dbd8) = 1048
> > 1048  exit_group(42)                    = ?
> > 931   waitpid(1048,  <unfinished ...>
> > 1048  +++ exited with 42 +++
> > 931   <... waitpid resumed> 0xf763d3a0, 0) = -1 ECHILD (No child processes)
>
> Argh. I seem to understand
>
> I didn't check, but I think the offending commit is 823b018e5b1196d8
> "job control: Small reorganization of wait_consider_task()".
>
> ptracer sees EXIT_ZOMBIE and temporary sets EXIT_DEAD, this fools
> the ->real_parent.
>
> I need to think. The fix should be simple, but perhaps it is the
> time to kill EXIT_DEAD altogether. I'll try to make the patch
> after vacation. In the next year ;)
>
> Thanks a lot Denys!

I've made the simple test-case, it triggers the bug.

Oleg.

#include <sys/ptrace.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdio.h>
#include <assert.h>

int main(void)
{
	int pid, status;

	pid = fork();
	if (!pid) {
		for (;;) {
			if (!fork())
				return 0x23;
			assert(waitpid(-1, &status, 0) > 0);
			assert(status == 0x2300);
		}
	}

	assert(ptrace(PTRACE_ATTACH, pid, 0,0) == 0);
	assert(waitpid(-1, NULL, 0) == pid);

	assert(ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_TRACEFORK) == 0);

	for (;;) {
		ptrace(PTRACE_CONT, pid, 0, 0);
		pid = waitpid(-1, NULL, 0);
		assert(pid > 0);
	}

	return 0;
}


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Possible bug introduced in commit 9b84cca
  2011-12-29 12:05   ` Oleg Nesterov
@ 2012-01-03 14:29     ` Oleg Nesterov
  2012-01-03 15:44       ` ptrace fixes for 3.2 Oleg Nesterov
  0 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2012-01-03 14:29 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Tejun Heo, Denys Vlasenko, linux-kernel, Łukasz Michalik,
	Dmitry V. Levin

On 12/29, Oleg Nesterov wrote:
>
> On 12/29, Oleg Nesterov wrote:
> >
> > On 12/28, Denys Vlasenko wrote:
> > >
> > > 931   clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xf763dbd8) = 1048
> > > 1048  exit_group(42)                    = ?
> > > 931   waitpid(1048,  <unfinished ...>
> > > 1048  +++ exited with 42 +++
> > > 931   <... waitpid resumed> 0xf763d3a0, 0) = -1 ECHILD (No child processes)
> >
> I've made the simple test-case, it triggers the bug.

I am going to send the hack^Wpatch below to Linus as a temporary
workaround for 3.2.

Any chance you can test it?

I did the testing too with my test-case, but it would be nice to
know if it works for you.

Thanks,

Oleg.

--- x/kernel/exit.c~	2011-11-22 18:46:22.000000000 +0100
+++ x/kernel/exit.c	2012-01-03 14:37:17.000000000 +0100
@@ -1540,8 +1540,11 @@ static int wait_consider_task(struct wai
 	}
 
 	/* dead body doesn't have much to contribute */
-	if (p->exit_state == EXIT_DEAD)
+	if (unlikely(p->exit_state == EXIT_DEAD)) {
+		if (likely(!ptrace) && unlikely(ptrace_reparented(p)))
+			wo->notask_error = 0;
 		return 0;
+	}
 
 	/* slay zombie? */
 	if (p->exit_state == EXIT_ZOMBIE) {


^ permalink raw reply	[flat|nested] 15+ messages in thread

* ptrace fixes for 3.2
  2012-01-03 14:29     ` Oleg Nesterov
@ 2012-01-03 15:44       ` Oleg Nesterov
  2012-01-03 16:30         ` Tejun Heo
  2012-01-03 16:51         ` Denys Vlasenko
  0 siblings, 2 replies; 15+ messages in thread
From: Oleg Nesterov @ 2012-01-03 15:44 UTC (permalink / raw)
  To: Denys Vlasenko, Tejun Heo
  Cc: Denys Vlasenko, linux-kernel, Łukasz Michalik,
	Dmitry V. Levin

On 01/03, Oleg Nesterov wrote:
>
> I am going to send the hack^Wpatch below to Linus as a temporary
> workaround for 3.2.

The same patch with the changelog.

Denys, Tejun, any chance you can review it before I send it to Linus ?

Also, I am going to send this one:

	http://marc.info/?l=linux-kernel&m=131705871825598&w=2

could you please take a look?

Oleg.

------------------------------------------------------------------------------
[PATCH] ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE race

Test-case:

	int main(void)
	{
		int pid, status;

		pid = fork();
		if (!pid) {
			for (;;) {
				if (!fork())
					return 0;
				if (waitpid(-1, &status, 0) < 0) {
					printf("ERR!! wait: %m\n");
					return 0;
				}
			}
		}

		assert(ptrace(PTRACE_ATTACH, pid, 0,0) == 0);
		assert(waitpid(-1, NULL, 0) == pid);

		assert(ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_TRACEFORK) == 0);

		do {
			ptrace(PTRACE_CONT, pid, 0, 0);
			pid = waitpid(-1, NULL, 0);
		} while (pid > 0);

		return 1;
	}

It fails because ->real_parent sees its child in EXIT_DEAD state
while the tracer is going to change the state back to EXIT_ZOMBIE
in wait_task_zombie().

The offending commit is 823b018e which moved the EXIT_DEAD check,
but in fact we should not blame it. The original code was not
correct as well because it didn't take ptrace_reparented() into
account and because we can't really trust ->ptrace.

This patch adds the additional check to close this particular
race but it doesn't solve the whole problem. We simply can't
rely on ->ptrace in this case, it can be cleared if the tracer
is multithreaded by the exiting ->parent.

I think we should kill EXIT_DEAD altogether, we should always
remove the soon-to-be-reaped child from ->children or at least
we should never do the DEAD->ZOMBIE transition. But this is too
complex for 3.2.

Also, I think wait_consider_task() needs more fixes. I do not
think we should clear ->notask_error without WEXITED in this
case, but this is what we do in the EXIT_ZOMBIE case.

Reported-by: Denys Vlasenko <vda.linux@googlemail.com>
Cc: v3.0.. <stable@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/exit.c |    9 ++++++++-
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index d0b7d98..e6e01b9 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1540,8 +1540,15 @@ static int wait_consider_task(struct wait_opts *wo, int ptrace,
 	}

 	/* dead body doesn't have much to contribute */
-	if (p->exit_state == EXIT_DEAD)
+	if (unlikely(p->exit_state == EXIT_DEAD)) {
+		/*
+		 * But do not ignore this task until the tracer does
+		 * wait_task_zombie()->do_notify_parent().
+		 */
+		if (likely(!ptrace) && unlikely(ptrace_reparented(p)))
+			wo->notask_error = 0;
 		return 0;
+	}

 	/* slay zombie? */
 	if (p->exit_state == EXIT_ZOMBIE) {
-- 
1.5.5.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: ptrace fixes for 3.2
  2012-01-03 15:44       ` ptrace fixes for 3.2 Oleg Nesterov
@ 2012-01-03 16:30         ` Tejun Heo
  2012-01-03 17:09           ` Oleg Nesterov
  2012-01-03 16:51         ` Denys Vlasenko
  1 sibling, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2012-01-03 16:30 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Denys Vlasenko, Denys Vlasenko, linux-kernel,
	Łukasz Michalik, Dmitry V. Levin

Hello, Oleg.

On Tue, Jan 03, 2012 at 04:44:04PM +0100, Oleg Nesterov wrote:
> It fails because ->real_parent sees its child in EXIT_DEAD state
> while the tracer is going to change the state back to EXIT_ZOMBIE
> in wait_task_zombie().

Argh.... EXIT_ZOMBIE -> DEAD -> ZOMBIE dancing in wait_task_zombie()
is just nasty.  Didn't realize it was doing that.  :(

> The offending commit is 823b018e which moved the EXIT_DEAD check,
> but in fact we should not blame it. The original code was not
> correct as well because it didn't take ptrace_reparented() into
> account and because we can't really trust ->ptrace.
> 
> This patch adds the additional check to close this particular
> race but it doesn't solve the whole problem. We simply can't
> rely on ->ptrace in this case, it can be cleared if the tracer
> is multithreaded by the exiting ->parent.

I'm not following this part.  Can you please explain it in a bit more
detail?

> I think we should kill EXIT_DEAD altogether, we should always
> remove the soon-to-be-reaped child from ->children or at least
> we should never do the DEAD->ZOMBIE transition. But this is too
> complex for 3.2.

Agreed.  Removing the reverse transition shouldn't be too difficult
and can be done without affecting fast non-ptrace path.  ie. if the
child is ptraced, drop readlock, grab writelock, recheck, buffer
states to copy out to userland, detach and transit to DEAD if
necessary.

> Also, I think wait_consider_task() needs more fixes. I do not
> think we should clear ->notask_error without WEXITED in this
> case, but this is what we do in the EXIT_ZOMBIE case.

Hmmm... I'm not sure about that.  Why do you think so?

> Reported-by: Denys Vlasenko <vda.linux@googlemail.com>
> Cc: v3.0.. <stable@kernel.org>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Anyways, the fix looks good to me.

 Acked-by: Tejun Heo <tj@kernel.org>

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ptrace fixes for 3.2
  2012-01-03 15:44       ` ptrace fixes for 3.2 Oleg Nesterov
  2012-01-03 16:30         ` Tejun Heo
@ 2012-01-03 16:51         ` Denys Vlasenko
  2012-01-04  9:00           ` Łukasz Michalik
  1 sibling, 1 reply; 15+ messages in thread
From: Denys Vlasenko @ 2012-01-03 16:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Denys Vlasenko, Tejun Heo, linux-kernel, Łukasz Michalik,
	Dmitry V. Levin

On 01/03/2012 04:44 PM, Oleg Nesterov wrote:
> On 01/03, Oleg Nesterov wrote:
>>
>> I am going to send the hack^Wpatch below to Linus as a temporary
>> workaround for 3.2.
>
> The same patch with the changelog.
>
> Denys, Tejun, any chance you can review it before I send it to Linus ?

Testing it on top of Fedora 16's kernel-3.1.6-1.fc16:

Confirmed that before patch both tests fail (in about a minute of run time
at worst), and with patch they do not fail.

-- 
vda

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ptrace fixes for 3.2
  2012-01-03 16:30         ` Tejun Heo
@ 2012-01-03 17:09           ` Oleg Nesterov
  2012-01-03 19:18             ` Tejun Heo
  0 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2012-01-03 17:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Denys Vlasenko, Denys Vlasenko, linux-kernel,
	Łukasz Michalik, Dmitry V. Levin

Hi Tejun,

On 01/03, Tejun Heo wrote:
>
> On Tue, Jan 03, 2012 at 04:44:04PM +0100, Oleg Nesterov wrote:
> > It fails because ->real_parent sees its child in EXIT_DEAD state
> > while the tracer is going to change the state back to EXIT_ZOMBIE
> > in wait_task_zombie().
>
> Argh.... EXIT_ZOMBIE -> DEAD -> ZOMBIE dancing in wait_task_zombie()
> is just nasty.  Didn't realize it was doing that.  :(

We both missed this ;)

> > The offending commit is 823b018e which moved the EXIT_DEAD check,
> > but in fact we should not blame it. The original code was not
> > correct as well because it didn't take ptrace_reparented() into
> > account and because we can't really trust ->ptrace.
> >
> > This patch adds the additional check to close this particular
> > race but it doesn't solve the whole problem. We simply can't
> > rely on ->ptrace in this case, it can be cleared if the tracer
> > is multithreaded by the exiting ->parent.
>
> I'm not following this part.  Can you please explain it in a bit more
> detail?

Before 823b018e the code was:

	if (!ptrace && p->ptrace) {
		wo->notask_error = 0;
		return 0;
	}

	if (p->exit_state == EXIT_DEAD)
		return 0;

There are 2 problems:

	1. it is not correct to clear ->notask_error unless
	   this child is ptrace_reparented(). Nobody will
	   wakeup us if EXIT_DEAD was set by our sub-thread.

	2. We can not rely on ->ptrace to detect this case.

	   Suppose that the tracer is multithreaded, it has
	   two threads T1 and T2, T1 traces our child.

	   - T2 does do_wait(WEXITED), sets EXIT_DEAD, drops
	     tasklist_lock.

	   - T1 exits and does __ptrace_detach(), this means
	     __ptrace_unlink() and nothing more.

	   - Now, real_parent does do_wait() and sees the
	     EXIT_DEAD child but ->ptrace = 0.

	   - finally T2 sets EXIT_DEAD but it is too late,

The patch doesn't solve the 2nd (btw very old) problem. Fortunately
this race is very unlikely.

> > I think we should kill EXIT_DEAD altogether, we should always
> > remove the soon-to-be-reaped child from ->children or at least
> > we should never do the DEAD->ZOMBIE transition. But this is too
> > complex for 3.2.
>
> Agreed.  Removing the reverse transition shouldn't be too difficult
> and can be done without affecting fast non-ptrace path.  ie. if the
> child is ptraced, drop readlock, grab writelock, recheck, buffer
> states to copy out to userland, detach and transit to DEAD if
> necessary.

Yes.

> > Also, I think wait_consider_task() needs more fixes. I do not
> > think we should clear ->notask_error without WEXITED in this
> > case, but this is what we do in the EXIT_ZOMBIE case.
>
> Hmmm... I'm not sure about that.  Why do you think so?

I am not sure too. But why do_wait() should sleep if it is called
without WEXITED (lets ignore WCONTINUED) and the child is ZOMBIE?
I think it should return -ECHILD, like it does if the child is not
traced.

IOW. Suppose we have a single EXIT_ZOMBIE child. If it is not traced,
do_wait(WSTOPPED) returns -ECHILD. If the child is traced (by another
process) do_wait() sleeps until detach just to return the same error.
This looks a bit strange.

>  Acked-by: Tejun Heo <tj@kernel.org>

Great, thanks.

Oleg.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ptrace fixes for 3.2
  2012-01-03 17:09           ` Oleg Nesterov
@ 2012-01-03 19:18             ` Tejun Heo
  2012-01-04 11:35               ` Oleg Nesterov
  0 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2012-01-03 19:18 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Denys Vlasenko, Denys Vlasenko, linux-kernel,
	Łukasz Michalik, Dmitry V. Levin

Hello,

On Tue, Jan 03, 2012 at 06:09:27PM +0100, Oleg Nesterov wrote:
> > > The offending commit is 823b018e which moved the EXIT_DEAD check,
> > > but in fact we should not blame it. The original code was not
> > > correct as well because it didn't take ptrace_reparented() into
> > > account and because we can't really trust ->ptrace.
> > >
> > > This patch adds the additional check to close this particular
> > > race but it doesn't solve the whole problem. We simply can't
> > > rely on ->ptrace in this case, it can be cleared if the tracer
> > > is multithreaded by the exiting ->parent.
> >
> > I'm not following this part.  Can you please explain it in a bit more
> > detail?
> 
> Before 823b018e the code was:
> 
> 	if (!ptrace && p->ptrace) {
> 		wo->notask_error = 0;
> 		return 0;
> 	}
> 
> 	if (p->exit_state == EXIT_DEAD)
> 		return 0;
> 
> There are 2 problems:
> 
> 	1. it is not correct to clear ->notask_error unless
> 	   this child is ptrace_reparented(). Nobody will
> 	   wakeup us if EXIT_DEAD was set by our sub-thread.

Yeah, if that happens between normal and ptrace wait_consider_task()
calls.

> 	2. We can not rely on ->ptrace to detect this case.
> 
> 	   Suppose that the tracer is multithreaded, it has
> 	   two threads T1 and T2, T1 traces our child.
> 
> 	   - T2 does do_wait(WEXITED), sets EXIT_DEAD, drops
> 	     tasklist_lock.
> 
> 	   - T1 exits and does __ptrace_detach(), this means
> 	     __ptrace_unlink() and nothing more.
> 
> 	   - Now, real_parent does do_wait() and sees the
> 	     EXIT_DEAD child but ->ptrace = 0.
> 
> 	   - finally T2 sets EXIT_DEAD but it is too late,

You mean EXIT_ZOMBIE here, right?  So, to rephrase, ZOMBIE -> DEAD ->
ZOMBIE dancing may race against tracer detaching.  Urgh... why do we
even allow tasks which aren't the direct tracer to wait for ptraced
children anyway when ptrace ops are restricted to the ptracing task?

> > > Also, I think wait_consider_task() needs more fixes. I do not
> > > think we should clear ->notask_error without WEXITED in this
> > > case, but this is what we do in the EXIT_ZOMBIE case.
> >
> > Hmmm... I'm not sure about that.  Why do you think so?
> 
> I am not sure too. But why do_wait() should sleep if it is called
> without WEXITED (lets ignore WCONTINUED) and the child is ZOMBIE?
> I think it should return -ECHILD, like it does if the child is not
> traced.
> 
> IOW. Suppose we have a single EXIT_ZOMBIE child. If it is not traced,
> do_wait(WSTOPPED) returns -ECHILD. If the child is traced (by another
> process) do_wait() sleeps until detach just to return the same error.
> This looks a bit strange.

I don't think it really matter either way.  To me, both behaviors seem
understandable and I don't think the current behavior would cause any
real problem.

Thanks a lot for the explanation.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ptrace fixes for 3.2
  2012-01-03 16:51         ` Denys Vlasenko
@ 2012-01-04  9:00           ` Łukasz Michalik
  0 siblings, 0 replies; 15+ messages in thread
From: Łukasz Michalik @ 2012-01-04  9:00 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Oleg Nesterov, Denys Vlasenko, Tejun Heo, linux-kernel,
	Dmitry V. Levin

[-- Attachment #1: Type: text/plain, Size: 719 bytes --]

On 17:51 2012-01-03 +0100, Denys Vlasenko wrote:
> On 01/03/2012 04:44 PM, Oleg Nesterov wrote:
> >On 01/03, Oleg Nesterov wrote:
> >>
> >>I am going to send the hack^Wpatch below to Linus as a temporary
> >>workaround for 3.2.
> >
> >The same patch with the changelog.
> >
> >Denys, Tejun, any chance you can review it before I send it to Linus ?
> 
> Testing it on top of Fedora 16's kernel-3.1.6-1.fc16:
> 
> Confirmed that before patch both tests fail (in about a minute of run time
> at worst), and with patch they do not fail.
> 

Tested on top vanilla's HEAD and stable 3.1.7, with both testcases and
both strace and sydbox.  Confirming same as Denys.

-- 
Pozdrawiam,
Łukasz P. Michalik

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ptrace fixes for 3.2
  2012-01-03 19:18             ` Tejun Heo
@ 2012-01-04 11:35               ` Oleg Nesterov
  2012-01-04 15:31                 ` Tejun Heo
  0 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2012-01-04 11:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Denys Vlasenko, Denys Vlasenko, linux-kernel,
	Łukasz Michalik, Dmitry V. Levin

Hi,

On 01/03, Tejun Heo wrote:
>
> On Tue, Jan 03, 2012 at 06:09:27PM +0100, Oleg Nesterov wrote:
>
> > 	2. We can not rely on ->ptrace to detect this case.
> >
> > 	   Suppose that the tracer is multithreaded, it has
> > 	   two threads T1 and T2, T1 traces our child.
> >
> > 	   - T2 does do_wait(WEXITED), sets EXIT_DEAD, drops
> > 	     tasklist_lock.
> >
> > 	   - T1 exits and does __ptrace_detach(), this means
> > 	     __ptrace_unlink() and nothing more.
> >
> > 	   - Now, real_parent does do_wait() and sees the
> > 	     EXIT_DEAD child but ->ptrace = 0.
> >
> > 	   - finally T2 sets EXIT_DEAD but it is too late,
>
> You mean EXIT_ZOMBIE here, right?

Yes, typo...

> So, to rephrase, ZOMBIE -> DEAD ->
> ZOMBIE dancing may race against tracer detaching.  Urgh... why do we
> even allow tasks which aren't the direct tracer to wait for ptraced
> children anyway when ptrace ops are restricted to the ptracing task?

If only I knew ;) I am not even sure this is by design... OTOH, it is
not clear how we should restrict do_wait() if the tracee is not
ptrace_reparented().

Anyway we can't change this.

> > I am not sure too. But why do_wait() should sleep if it is called
> > without WEXITED (lets ignore WCONTINUED) and the child is ZOMBIE?
> > I think it should return -ECHILD, like it does if the child is not
> > traced.
> >
> > IOW. Suppose we have a single EXIT_ZOMBIE child. If it is not traced,
> > do_wait(WSTOPPED) returns -ECHILD. If the child is traced (by another
> > process) do_wait() sleeps until detach just to return the same error.
> > This looks a bit strange.
>
> I don't think it really matter either way.  To me, both behaviors seem
> understandable and I don't think the current behavior would cause any
> real problem.

Yes I agree this is minor, surely not a bug.

OK, in any case this has nothing to do with this fix, I'll remove this
sentence from the changelog to avoid the confusion.

Tejun, do you see any problem with the 2nd fix below?

Oleg.

------------------------------------------------------------------------------
Can't understand how did we miss this, but WARN_ON_ONCE(!ptrace)
in do_signal_stop() is not right. Debugger can resume the stopped
task, and it can clone the _untraced_ thread running in the stopped
group.

And even if the new thread is auto-attached, we have the problems
with JOBCTL_STOP_SIGMASK.

I do not want to discuss the "proper" solution here. I think the
necessary changes are simple, but I do not think this is the 3.1
material, and 3.1 needs some trivial fix.

What do you think about the patch below? I am going to send it to
Linus unless you see something wrong.

And I'd like to explain why I prefer to add the (temporary) hack
into __ptrace_unlink() instead of adding

	if (unlikely(ptrace) && current->ptrace) {
		...
		child->jobctl = current->jobctl & JOBCTL_STOP_SIGMASK;
		...
	}

into ptrace_init_task(). I think that jobctl & JOBCTL_STOP_SIGMASK"
should be cleanuped. Look, once we set JOBCTL_STOP_SIGMASK we never
clear it. Yes, this doesn't really matter but this can hide an error,
for example this can "fool" the WARN_ON(!signr) in do_jobctl_trap().
Imho, at least SIGCONT should clear this part of ->jobctl. IOW, it
should be non-zero only if/when it has effect.

That is why I do not want to change ptrace_init_task() until we
decide what should we do with CLONE_THREAD && SIGNAL_STOP_STOPPED,
to avoid the bogus (jobctl & JOBCTL_STOP_SIGMASK) != 0. IOW I prefer
the change in __ptrace_unlink() to catch the other possible problems
like this.

Also, I'd like to defend 6634ae10
"ptrace_init_task: initialize child->jobctl explicitly" which can
be blamed for the 2nd problem. Afaics, this change is really needed
and it fixes the bug. The changelog says "Currently this is harmless"
but this is not right, dup_task_struct() can happen between SIGSTOP and
SIGCONT, in this case the child can have the wrong JOBCTL_STOP_PENDING.

So, what do you think?

Oleg.

-------------------------------------------------------------------------------
[PATCH for 3.1] ptrace_unlink: ensure JOBCTL_STOP_SIGMASK is not zero

This is the temporary simple fix for 3.1, we need more changes in this
area.

1. do_signal_stop() assumes that the running untraced thread in the
   stopped thread group is not possible. This was our goal but it is
   not yet achieved: a stopped-but-resumed tracee can clone the running
   thread which can initiate another group-stop.

   Remove WARN_ON_ONCE(!current->ptrace).

2. A new thread always starts with ->jobctl = 0. If it is auto-attached
   and this group is stopped, __ptrace_unlink() sets JOBCTL_STOP_PENDING
   but JOBCTL_STOP_SIGMASK part is zero, this triggers WANR_ON(!signr)
   in do_jobctl_trap() if another debugger attaches.

   Change __ptrace_unlink() to set the artificial SIGSTOP for report.

   Alternatively we could change ptrace_init_task() to copy signr from
   current, but this means we can copy it for no reason and hide the
   possible similar problems.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---

 kernel/signal.c |    2 --
 kernel/ptrace.c |   13 ++++++++++++-
 2 files changed, 12 insertions(+), 3 deletions(-)

--- 3.1/kernel/signal.c~2_traced_stop_and_clone	2011-09-25 19:14:32.000000000 +0200
+++ 3.1/kernel/signal.c	2011-09-25 19:53:54.000000000 +0200
@@ -1986,8 +1986,6 @@ static bool do_signal_stop(int signr)
 		 */
 		if (!(sig->flags & SIGNAL_STOP_STOPPED))
 			sig->group_exit_code = signr;
-		else
-			WARN_ON_ONCE(!current->ptrace);

 		sig->group_stop_count = 0;

--- 3.1/kernel/ptrace.c~2_traced_stop_and_clone	2011-09-25 19:40:57.000000000 +0200
+++ 3.1/kernel/ptrace.c	2011-09-25 20:05:25.000000000 +0200
@@ -96,9 +96,20 @@ void __ptrace_unlink(struct task_struct 
 	 */
 	if (!(child->flags & PF_EXITING) &&
 	    (child->signal->flags & SIGNAL_STOP_STOPPED ||
-	     child->signal->group_stop_count))
+	     child->signal->group_stop_count)) {
 		child->jobctl |= JOBCTL_STOP_PENDING;

+		/*
+		 * This is only possible if this thread was cloned by the
+		 * traced task running in the stopped group, set the signal
+		 * for the future reports.
+		 * FIXME: we should change ptrace_init_task() to handle this
+		 * case.
+		 */
+		if (!(child->jobctl & JOBCTL_STOP_SIGMASK))
+			child->jobctl |= SIGSTOP;
+	}
+
 	/*
 	 * If transition to TASK_STOPPED is pending or in TASK_TRACED, kick
 	 * @child in the butt.  Note that @resume should be used iff @child

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ptrace fixes for 3.2
  2012-01-04 11:35               ` Oleg Nesterov
@ 2012-01-04 15:31                 ` Tejun Heo
  2012-01-04 15:59                   ` Oleg Nesterov
  0 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2012-01-04 15:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Denys Vlasenko, Denys Vlasenko, linux-kernel,
	Łukasz Michalik, Dmitry V. Levin

Hello,

On Wed, Jan 04, 2012 at 12:35:34PM +0100, Oleg Nesterov wrote:
> Can't understand how did we miss this, but WARN_ON_ONCE(!ptrace)
> in do_signal_stop() is not right. Debugger can resume the stopped
> task, and it can clone the _untraced_ thread running in the stopped
> group.

Right, we should be setting JOBCTL_STOP_PENDING for newly cloned tasks
if sigstop is in effect.

> And even if the new thread is auto-attached, we have the problems
> with JOBCTL_STOP_SIGMASK.
> 
> I do not want to discuss the "proper" solution here. I think the
> necessary changes are simple, but I do not think this is the 3.1
> material, and 3.1 needs some trivial fix.

Yeah, sure thing.

> What do you think about the patch below? I am going to send it to
> Linus unless you see something wrong.
> 
> And I'd like to explain why I prefer to add the (temporary) hack
> into __ptrace_unlink() instead of adding
> 
> 	if (unlikely(ptrace) && current->ptrace) {
> 		...
> 		child->jobctl = current->jobctl & JOBCTL_STOP_SIGMASK;
> 		...
> 	}
> 
> into ptrace_init_task(). I think that jobctl & JOBCTL_STOP_SIGMASK"
> should be cleanuped. Look, once we set JOBCTL_STOP_SIGMASK we never
> clear it. Yes, this doesn't really matter but this can hide an error,
> for example this can "fool" the WARN_ON(!signr) in do_jobctl_trap().
> Imho, at least SIGCONT should clear this part of ->jobctl. IOW, it
> should be non-zero only if/when it has effect.

I don't recall the details but there was somewhing convoluted about
clearing the signal number.  There's some non-obvious case where the
signr is used after what appears to be the apparent lifespan.  I
*think* we already talked about this but could be imagining things.
But, yeah, sure, if we can do it without complicating stuff, why not?

> That is why I do not want to change ptrace_init_task() until we
> decide what should we do with CLONE_THREAD && SIGNAL_STOP_STOPPED,
> to avoid the bogus (jobctl & JOBCTL_STOP_SIGMASK) != 0. IOW I prefer
> the change in __ptrace_unlink() to catch the other possible problems
> like this.

As long as it gets fixed properly in the next devel cycle, I don't
have any objection.

> Also, I'd like to defend 6634ae10
> "ptrace_init_task: initialize child->jobctl explicitly" which can
> be blamed for the 2nd problem. Afaics, this change is really needed
> and it fixes the bug. The changelog says "Currently this is harmless"
> but this is not right, dup_task_struct() can happen between SIGSTOP and
> SIGCONT, in this case the child can have the wrong JOBCTL_STOP_PENDING.
> 
> So, what do you think?

Looks good to me, provided proper fix is coming soon. :p

> -------------------------------------------------------------------------------
> [PATCH for 3.1] ptrace_unlink: ensure JOBCTL_STOP_SIGMASK is not zero
> 
> This is the temporary simple fix for 3.1, we need more changes in this
> area.
> 
> 1. do_signal_stop() assumes that the running untraced thread in the
>    stopped thread group is not possible. This was our goal but it is
>    not yet achieved: a stopped-but-resumed tracee can clone the running
>    thread which can initiate another group-stop.
> 
>    Remove WARN_ON_ONCE(!current->ptrace).
> 
> 2. A new thread always starts with ->jobctl = 0. If it is auto-attached
>    and this group is stopped, __ptrace_unlink() sets JOBCTL_STOP_PENDING
>    but JOBCTL_STOP_SIGMASK part is zero, this triggers WANR_ON(!signr)
>    in do_jobctl_trap() if another debugger attaches.
> 
>    Change __ptrace_unlink() to set the artificial SIGSTOP for report.
> 
>    Alternatively we could change ptrace_init_task() to copy signr from
>    current, but this means we can copy it for no reason and hide the
>    possible similar problems.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

 Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: ptrace fixes for 3.2
  2012-01-04 15:31                 ` Tejun Heo
@ 2012-01-04 15:59                   ` Oleg Nesterov
  0 siblings, 0 replies; 15+ messages in thread
From: Oleg Nesterov @ 2012-01-04 15:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Denys Vlasenko, Denys Vlasenko, linux-kernel,
	Łukasz Michalik, Dmitry V. Levin

Hello,

On 01/04, Tejun Heo wrote:
>
> On Wed, Jan 04, 2012 at 12:35:34PM +0100, Oleg Nesterov wrote:
> > Can't understand how did we miss this, but WARN_ON_ONCE(!ptrace)
> > in do_signal_stop() is not right. Debugger can resume the stopped
> > task, and it can clone the _untraced_ thread running in the stopped
> > group.
>
> Right, we should be setting JOBCTL_STOP_PENDING for newly cloned tasks
> if sigstop is in effect.

Yes, probably this is the natural choice.

> Looks good to me, provided proper fix is coming soon. :p

Good.

Yes, this is the temporary fix for 3.2.

Thanks!

Oleg.


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2012-01-04 16:05 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-28 18:55 Possible bug introduced in commit 9b84cca Denys Vlasenko
2011-12-28 21:07 ` Denys Vlasenko
2011-12-28 21:23   ` Łukasz Michalik
2011-12-29 11:32 ` Oleg Nesterov
2011-12-29 12:05   ` Oleg Nesterov
2012-01-03 14:29     ` Oleg Nesterov
2012-01-03 15:44       ` ptrace fixes for 3.2 Oleg Nesterov
2012-01-03 16:30         ` Tejun Heo
2012-01-03 17:09           ` Oleg Nesterov
2012-01-03 19:18             ` Tejun Heo
2012-01-04 11:35               ` Oleg Nesterov
2012-01-04 15:31                 ` Tejun Heo
2012-01-04 15:59                   ` Oleg Nesterov
2012-01-03 16:51         ` Denys Vlasenko
2012-01-04  9:00           ` Łukasz Michalik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).