All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: ebiederm@xmission.com (Eric W. Biederman)
Cc: Oleg Nesterov <oleg@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Pavel Emelyanov <xemul@parallels.com>,
	Cyrill Gorcunov <gorcunov@openvz.org>,
	Louis Rilling <louis.rilling@kerlabs.com>,
	Mike Galbraith <efault@gmx.de>
Subject: Re: [PATCH] pidns: Guarantee that the pidns init will be the last pidns process reaped. v2
Date: Tue, 22 May 2012 12:23:15 -0700	[thread overview]
Message-ID: <20120522122315.c3f2118c.akpm@linux-foundation.org> (raw)
In-Reply-To: <87d35x5ank.fsf_-_@xmission.com>

On Mon, 21 May 2012 18:20:31 -0600
ebiederm@xmission.com (Eric W. Biederman) wrote:

> 
> Today we have a two-fold bug.  Sometimes release_task on pid == 1 in a
> pid namespace can run before other processes in a pid namespace have had
> release task called.  With the result that pid_ns_release_proc can be
> called before the last proc_flus_task() is done using
> upid->ns->proc_mnt, resulting in the use of a stale pointer.  This same
> set of circumstances can lead to waitpid(...) returning for a processes
> started with clone(CLONE_NEWPID) before the every process in the pid
> namespace has actually exited.
> 
> To fix this modify zap_pid_ns_processess wait until all other processes
> in the pid namespace have exited, even EXIT_DEAD zombies.
> 
> The delay_group_leader and related tests ensure that the thread gruop
> leader will be the last thread of a process group to be reaped, or to
> become EXIT_DEAD and self reap.  With the change to zap_pid_ns_processes
> we get the guarantee that pid == 1 in a pid namespace will be the last
> task that release_task is called on.
> 
> With pid == 1 being the last task to pass through release_task
> pid_ns_release_proc can no longer be called too early nor can wait
> return before all of the EXIT_DEAD tasks in a pid namespace have exited.
> 
> ...
>
> index d8bd3b42..abc4fc0 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -64,15 +64,26 @@ static void exit_mm(struct task_struct * tsk);
>  static void __unhash_process(struct task_struct *p, bool group_dead)
>  {
>  	nr_threads--;
> -	detach_pid(p, PIDTYPE_PID);
>  	if (group_dead) {
> +		struct task_struct *parent;
> +
>  		detach_pid(p, PIDTYPE_PGID);
>  		detach_pid(p, PIDTYPE_SID);
>  
>  		list_del_rcu(&p->tasks);
>  		list_del_init(&p->sibling);
>  		__this_cpu_dec(process_counts);
> +
> +		/* If we are the last child process in a pid namespace

like this:
		/*
		 * If ...

> +		 * to be reaped notify the child_reaper.

s/reaped/reaped,/

More seriously, it isn't a very good comment.  It tells us "what" the
code is doing (which is pretty obvious from reading it), but it didn't
tell us "why" it is doing this.  Why do PID namespaces need special
handling here?  What's the backstory??

> +		 */
> +		parent = p->real_parent;
> +		if ((task_active_pid_ns(p)->child_reaper == parent) &&
> +		    list_empty(&parent->children) &&
> +		    (parent->flags & PF_EXITING))
> +			wake_up_process(parent);
>  	}
> +	detach_pid(p, PIDTYPE_PID);
>  	list_del_rcu(&p->thread_group);
>  }
>  
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index b98b0ed..ba1cbb8 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -189,6 +189,17 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>  		rc = sys_wait4(-1, NULL, __WALL, NULL);
>  	} while (rc != -ECHILD);
>  
> +	read_lock(&tasklist_lock);
> +	for (;;) {
> +		__set_current_state(TASK_UNINTERRUPTIBLE);
> +		if (list_empty(&current->children))
> +			break;
> +		read_unlock(&tasklist_lock);
> +		schedule();
> +		read_lock(&tasklist_lock);
> +	}
> +	read_unlock(&tasklist_lock);

Well.

a) This loop can leave the thread in state TASK_UNINTERRUPTIBLE,
   which looks wrong.

b) Given that the waking side is also testing list_empty(), I think
   you might need set_current_state() here, with the barrier.  So that
   this thread gets the correct view of the list_head wrt the waker's
   view.

c) Did we really need to bang on tasklist_lock so many times?  There
   doesn't seem a nicer way of structuring this :(

d) Anyone who reads this code a year from now will come away
   thinking "wtf".

   IOW, wtf?  We sit in a dead loop waiting for ->children to drain.
   But why?  Who is draining them?  What are the dynamics here? Why
   do we care?

   IOW, it needs a comment!!

  parent reply	other threads:[~2012-05-22 19:23 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-28  9:19 [RFC PATCH] namespaces: fix leak on fork() failure Mike Galbraith
2012-04-28 14:26 ` Oleg Nesterov
2012-04-29  4:13   ` Mike Galbraith
2012-04-29  7:57   ` Eric W. Biederman
2012-04-29  9:49     ` Mike Galbraith
2012-04-29 16:58     ` Oleg Nesterov
2012-04-30  2:59       ` Eric W. Biederman
2012-04-30  3:25         ` Mike Galbraith
2012-05-02 12:40         ` Oleg Nesterov
2012-05-02 17:37           ` Eric W. Biederman
2012-04-30  3:01       ` [PATCH] " Mike Galbraith
     [not found]         ` <m1zk9rmyh4.fsf@fess.ebiederm.org>
2012-05-01 20:42           ` Andrew Morton
2012-05-03  3:12             ` Mike Galbraith
2012-05-03 14:56               ` Mike Galbraith
2012-05-04  4:27                 ` Mike Galbraith
2012-05-04  7:55                   ` Eric W. Biederman
2012-05-04  8:34                     ` Mike Galbraith
2012-05-04  9:45                     ` Mike Galbraith
2012-05-04 14:13                       ` Eric W. Biederman
2012-05-04 14:49                         ` Mike Galbraith
2012-05-04 15:36                           ` Eric W. Biederman
2012-05-04 16:57                             ` Mike Galbraith
2012-05-04 20:29                               ` Eric W. Biederman
2012-05-05  5:56                                 ` Mike Galbraith
2012-05-05  6:08                                   ` Mike Galbraith
2012-05-05  7:12                                     ` Mike Galbraith
2012-05-05 11:37                                       ` Eric W. Biederman
2012-05-07 21:51                                       ` [PATCH] vfs: Speed up deactivate_super for non-modular filesystems Eric W. Biederman
2012-05-07 22:17                                         ` Al Viro
2012-05-07 23:56                                           ` Paul E. McKenney
2012-05-08  1:07                                             ` Eric W. Biederman
2012-05-08  4:53                                               ` Mike Galbraith
2012-05-09  7:55                                               ` Nick Piggin
2012-05-09 11:02                                                 ` Eric W. Biederman
2012-05-09 11:02                                                   ` Eric W. Biederman
2012-05-15  8:40                                                   ` Nick Piggin
2012-05-16  0:34                                                     ` Eric W. Biederman
2012-05-16  0:34                                                       ` Eric W. Biederman
2012-05-09 13:59                                                 ` Paul E. McKenney
2012-05-04  8:03                 ` [PATCH] Re: [RFC PATCH] namespaces: fix leak on fork() failure Eric W. Biederman
2012-05-04  8:19                   ` Mike Galbraith
2012-05-04  8:54                     ` Mike Galbraith
2012-05-07  0:32             ` [PATCH 0/3] pidns: Closing the pid namespace exit race Eric W. Biederman
2012-05-07  0:33               ` [PATCH 1/3] pidns: Use task_active_pid_ns in do_notify_parent Eric W. Biederman
2012-05-07  0:35               ` [PATCH 2/3] pidns: Guarantee that the pidns init will be the last pidns process reaped Eric W. Biederman
2012-05-08 22:50                 ` Andrew Morton
2012-05-16 18:39                 ` Oleg Nesterov
2012-05-16 19:34                   ` Oleg Nesterov
2012-05-16 20:54                   ` Eric W. Biederman
2012-05-17 17:00                     ` Oleg Nesterov
2012-05-17 21:46                       ` Eric W. Biederman
2012-05-18 12:39                         ` Oleg Nesterov
2012-05-19  0:03                           ` Eric W. Biederman
2012-05-21 12:44                             ` Oleg Nesterov
2012-05-22  0:16                               ` Eric W. Biederman
2012-05-22  0:20                               ` [PATCH] pidns: Guarantee that the pidns init will be the last pidns process reaped. v2 Eric W. Biederman
2012-05-22 16:54                                 ` Oleg Nesterov
2012-05-22 19:23                                 ` Andrew Morton [this message]
2012-05-23 14:52                                   ` Oleg Nesterov
2012-05-25 15:15                                     ` [PATCH -mm] pidns-guarantee-that-the-pidns-init-will-be-the-last-pidns-process-r eaped-v2-fix-fix Oleg Nesterov
2012-05-25 15:59                                       ` [PATCH -mm 0/1] pidns: find_new_reaper() can no longer switch to init_pid_ns.child_reaper Oleg Nesterov
2012-05-25 16:00                                         ` [PATCH -mm 1/1] " Oleg Nesterov
2012-05-25 21:43                                           ` Eric W. Biederman
2012-05-27 19:10                                             ` [PATCH v2 -mm 0/1] " Oleg Nesterov
2012-05-27 19:11                                               ` [PATCH v2 -mm 1/1] " Oleg Nesterov
2012-05-29  6:34                                                 ` Eric W. Biederman
2012-05-25 21:25                                       ` [PATCH -mm] pidns-guarantee-that-the-pidns-init-will-be-the-last-pidns-process-r eaped-v2-fix-fix Eric W. Biederman
2012-05-27 18:41                                         ` [PATCH -mm v2] " Oleg Nesterov
2012-05-07  0:35               ` [PATCH 3/3] pidns: Make killed children autoreap Eric W. Biederman
2012-05-08 22:51                 ` Andrew Morton
2012-04-30 13:57 ` [RFC PATCH] namespaces: fix leak on fork() failure Mike Galbraith

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120522122315.c3f2118c.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=ebiederm@xmission.com \
    --cc=efault@gmx.de \
    --cc=gorcunov@openvz.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=louis.rilling@kerlabs.com \
    --cc=oleg@redhat.com \
    --cc=xemul@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.