[PATCH review 0/3] pid namespaces fixes

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH review 0/3] pid namespaces fixes
@ 2012-12-22  4:56 Eric W. Biederman
  2012-12-22  4:57 ` [PATCH review 1/3] pidns: Outlaw thread creation after unshare(CLONE_NEWPID) Eric W. Biederman
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Eric W. Biederman @ 2012-12-22  4:56 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Linux Containers, linux-kernel, Serge E. Hallyn


Oleg assuming I am not blind these patches should fix the issues you
spotted in the pid namespace as well as one additional one that I found
during testing.

Anyone with an extra set of eyeballs that wants to look over this code
and double check to make certain I am not doing something stupid would
be welcome.

These patches are against 3.8-rc1 and I hope to get the into linux-next
and on to Linus shortly.

Eric W. Biederman (3):
      pidns: Outlaw thread creation after unshare(CLONE_NEWPID)
      pidns: Stop pid allocation when init dies
      proc: Allow proc_free_inum to be called from any context

 fs/proc/generic.c             |   13 +++++++------
 include/linux/pid.h           |    1 +
 include/linux/pid_namespace.h |    4 +++-
 kernel/fork.c                 |    8 ++++++++
 kernel/pid.c                  |   13 ++++++++++---
 kernel/pid_namespace.c        |    4 ++++
 6 files changed, 33 insertions(+), 10 deletions(-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH review 1/3] pidns: Outlaw thread creation after unshare(CLONE_NEWPID)
  2012-12-22  4:56 [PATCH review 0/3] pid namespaces fixes Eric W. Biederman
@ 2012-12-22  4:57 ` Eric W. Biederman
  2012-12-22 19:39   ` Rob Landley
  2012-12-22  4:58 ` [PATCH review 2/3] pidns: Stop pid allocation when init dies Eric W. Biederman
  2012-12-22  4:58 ` [PATCH review 3/3] proc: Allow proc_free_inum to be called from any context Eric W. Biederman
  2 siblings, 1 reply; 10+ messages in thread
From: Eric W. Biederman @ 2012-12-22  4:57 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Linux Containers, linux-kernel, Serge E. Hallyn


The sequence:
unshare(CLONE_NEWPID)
clone(CLONE_THREAD|CLONE_SIGHAND|CLONE_VM)

Creates a new process in the new pid namespace without setting
pid_ns->child_reaper.  After forking this results in a NULL
pointer dereference.

Avoid this and other nonsense scenarios that can show up after
creating a new pid namespace with unshare by adding a new
check in copy_prodcess.

Pointed-out-by:  Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 kernel/fork.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index a31b823..65ca6d2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1166,6 +1166,14 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 				current->signal->flags & SIGNAL_UNKILLABLE)
 		return ERR_PTR(-EINVAL);
 
+	/*
+	 * If the new process will be in a different pid namespace
+	 * don't allow the creation of threads.
+	 */
+	if ((clone_flags & (CLONE_VM|CLONE_NEWPID)) &&
+	    (task_active_pid_ns(current) != current->nsproxy->pid_ns))
+		return ERR_PTR(-EINVAL);
+
 	retval = security_task_create(clone_flags);
 	if (retval)
 		goto fork_out;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH review 1/3] pidns: Outlaw thread creation after unshare(CLONE_NEWPID)
  2012-12-22  4:57 ` [PATCH review 1/3] pidns: Outlaw thread creation after unshare(CLONE_NEWPID) Eric W. Biederman
@ 2012-12-22 19:39   ` Rob Landley
  2012-12-22 20:16     ` Eric W. Biederman
  0 siblings, 1 reply; 10+ messages in thread
From: Rob Landley @ 2012-12-22 19:39 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Oleg Nesterov, Linux Containers, linux-kernel

On 12/21/2012 10:57:34 PM, Eric W. Biederman wrote:
> 
> The sequence:
> unshare(CLONE_NEWPID)
> clone(CLONE_THREAD|CLONE_SIGHAND|CLONE_VM)
> 
> Creates a new process in the new pid namespace without setting
> pid_ns->child_reaper.  After forking this results in a NULL
> pointer dereference.
> 
> Avoid this and other nonsense scenarios that can show up after
> creating a new pid namespace with unshare by adding a new
> check in copy_prodcess.
> 
> Pointed-out-by:  Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  kernel/fork.c |    8 ++++++++
>  1 files changed, 8 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index a31b823..65ca6d2 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1166,6 +1166,14 @@ static struct task_struct  
> *copy_process(unsigned long clone_flags,
>  				current->signal->flags &  
> SIGNAL_UNKILLABLE)
>  		return ERR_PTR(-EINVAL);
> 
> +	/*
> +	 * If the new process will be in a different pid namespace
> +	 * don't allow the creation of threads.
> +	 */
> +	if ((clone_flags & (CLONE_VM|CLONE_NEWPID)) &&
> +	    (task_active_pid_ns(current) != current->nsproxy->pid_ns))
> +		return ERR_PTR(-EINVAL);
> +

Since the first bit will trigger if clone_flags has just CLONE_VM  
without CLONE_NEWPID, or vice versa, I'm guessing this is a fast path  
optimization? (Otherwise you meant (clone_flags &  
(CLONE_VM|CLONE_NEWPID)) == CLONE_VM|CLONE_NEWPID ?)

(Just trying to wrap my head around it...)

Rob

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH review 1/3] pidns: Outlaw thread creation after unshare(CLONE_NEWPID)
  2012-12-22 19:39   ` Rob Landley
@ 2012-12-22 20:16     ` Eric W. Biederman
  0 siblings, 0 replies; 10+ messages in thread
From: Eric W. Biederman @ 2012-12-22 20:16 UTC (permalink / raw)
  To: Rob Landley; +Cc: Oleg Nesterov, Linux Containers, linux-kernel

Rob Landley <rob@landley.net> writes:

> On 12/21/2012 10:57:34 PM, Eric W. Biederman wrote:
>> 
>> The sequence:
>> unshare(CLONE_NEWPID)
>> clone(CLONE_THREAD|CLONE_SIGHAND|CLONE_VM)
>> 
>> Creates a new process in the new pid namespace without setting
>> pid_ns->child_reaper.  After forking this results in a NULL
>> pointer dereference.
>> 
>> Avoid this and other nonsense scenarios that can show up after
>> creating a new pid namespace with unshare by adding a new
>> check in copy_prodcess.
>> 
>> Pointed-out-by:  Oleg Nesterov <oleg@redhat.com>
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  kernel/fork.c |    8 ++++++++
>>  1 files changed, 8 insertions(+), 0 deletions(-)
>> 
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index a31b823..65ca6d2 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1166,6 +1166,14 @@ static struct task_struct  
>> *copy_process(unsigned long clone_flags,
>>  				current->signal->flags &  
>> SIGNAL_UNKILLABLE)
>>  		return ERR_PTR(-EINVAL);
>> 
>> +	/*
>> +	 * If the new process will be in a different pid namespace
>> +	 * don't allow the creation of threads.
>> +	 */
>> +	if ((clone_flags & (CLONE_VM|CLONE_NEWPID)) &&
>> +	    (task_active_pid_ns(current) != current->nsproxy->pid_ns))
>> +		return ERR_PTR(-EINVAL);
>> +
>
> Since the first bit will trigger if clone_flags has just CLONE_VM  
> without CLONE_NEWPID, or vice versa, I'm guessing this is a fast path  
> optimization? (Otherwise you meant (clone_flags &  
> (CLONE_VM|CLONE_NEWPID)) == CLONE_VM|CLONE_NEWPID ?)
>
> (Just trying to wrap my head around it...)

Actually I mean (clone_flags & (CLONE_THREAD | CLONE_SIGHAND | CLONE_VM | CLONE_NEWPID))
CLONE_THREAD and CLONE_SIGHAND imply CLONE_VM... and that is enfored above.

I don't mean all of those flags must be in place.

CLONE_NEWPID is an optimization in that the test is also in copy_pid_ns
but there is no point in going to all of the work to get there if we are
going to be testing for this scenario anyway.

I definitely don't mean (clone_flags & (CLONE_VM|CLONE_NEWPID)) ==
(CLONE_VM|CLONE_NEWPID)).  The task_active_pid_ns(current) !=
current->nsproxy->pid_ns case is what tests to see if the pid namespace
has already been unshared.

The sequence "unshare(CLONE_NEWPID); unshare(CLONE_NEWPID);" is nonsense
and that is what CLONE_NEWPID is about in that test.

Similary the sequence "unshare(CLONE_NEWPID); clone(CLONE_THREAD);" is
nonsense and what the CLONE_VM is the test is for.  There are also a
number of other nonsense thread like states that CLONE_VM also catches
and prevents.

Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH review 2/3] pidns: Stop pid allocation when init dies
  2012-12-22  4:56 [PATCH review 0/3] pid namespaces fixes Eric W. Biederman
  2012-12-22  4:57 ` [PATCH review 1/3] pidns: Outlaw thread creation after unshare(CLONE_NEWPID) Eric W. Biederman
@ 2012-12-22  4:58 ` Eric W. Biederman
  2012-12-22 16:54   ` Oleg Nesterov
  2012-12-22  4:58 ` [PATCH review 3/3] proc: Allow proc_free_inum to be called from any context Eric W. Biederman
  2 siblings, 1 reply; 10+ messages in thread
From: Eric W. Biederman @ 2012-12-22  4:58 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Linux Containers, linux-kernel, Serge E. Hallyn


Oleg pointed out that in a pid namespace the sequence.
- pid 1 becomes a zombie
- setns(thepidns), fork,...
- reaping pid 1.
- The injected processes exiting.

Can lead to processes attempting access their child reaper and
instead following a stale pointer.

That waitpid for init can return before all of the processes in
the pid namespace have exited is also unfortunate.

Avoid these problems by disabling the allocation of new pids in a pid
namespace when init dies, instead of when the last process in a pid
namespace is reaped.

Pointed-out-by:  Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/pid.h           |    1 +
 include/linux/pid_namespace.h |    4 +++-
 kernel/pid.c                  |   13 ++++++++++---
 kernel/pid_namespace.c        |    4 ++++
 4 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index b152d44..2381c97 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -121,6 +121,7 @@ int next_pidmap(struct pid_namespace *pid_ns, unsigned int last);
 
 extern struct pid *alloc_pid(struct pid_namespace *ns);
 extern void free_pid(struct pid *pid);
+extern void disable_pid_allocation(struct pid_namespace *ns);
 
 /*
  * ns_of_pid() returns the pid namespace in which the specified pid was
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index bf28599..215e5e3 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -21,7 +21,7 @@ struct pid_namespace {
 	struct kref kref;
 	struct pidmap pidmap[PIDMAP_ENTRIES];
 	int last_pid;
-	int nr_hashed;
+	unsigned int nr_hashed;
 	struct task_struct *child_reaper;
 	struct kmem_cache *pid_cachep;
 	unsigned int level;
@@ -42,6 +42,8 @@ struct pid_namespace {
 
 extern struct pid_namespace init_pid_ns;
 
+#define PIDNS_HASH_ADDING (1U << 31)
+
 #ifdef CONFIG_PID_NS
 static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
 {
diff --git a/kernel/pid.c b/kernel/pid.c
index 36aa02f..3a5c872 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -270,7 +270,6 @@ void free_pid(struct pid *pid)
 			wake_up_process(ns->child_reaper);
 			break;
 		case 0:
-			ns->nr_hashed = -1;
 			schedule_work(&ns->proc_work);
 			break;
 		}
@@ -319,7 +318,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	upid = pid->numbers + ns->level;
 	spin_lock_irq(&pidmap_lock);
-	if (ns->nr_hashed < 0)
+	if (ns->nr_hashed < PIDNS_HASH_ADDING)
 		goto out_unlock;
 	for ( ; upid >= pid->numbers; --upid) {
 		hlist_add_head_rcu(&upid->pid_chain,
@@ -342,6 +341,14 @@ out_free:
 	goto out;
 }
 
+void disable_pid_allocation(struct pid_namespace *ns)
+{
+	spin_lock_irq(&pidmap_lock);
+	if (ns->nr_hashed >= PIDNS_HASH_ADDING)
+		ns->nr_hashed -= PIDNS_HASH_ADDING;
+	spin_unlock_irq(&pidmap_lock);
+}
+
 struct pid *find_pid_ns(int nr, struct pid_namespace *ns)
 {
 	struct hlist_node *elem;
@@ -584,7 +591,7 @@ void __init pidmap_init(void)
 	/* Reserve PID 0. We never call free_pidmap(0) */
 	set_bit(0, init_pid_ns.pidmap[0].page);
 	atomic_dec(&init_pid_ns.pidmap[0].nr_free);
-	init_pid_ns.nr_hashed = 1;
+	init_pid_ns.nr_hashed = 1 + PIDNS_HASH_ADDING;
 
 	init_pid_ns.pid_cachep = KMEM_CACHE(pid,
 			SLAB_HWCACHE_ALIGN | SLAB_PANIC);
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index fdbd0cd..c1c3dc1 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -115,6 +115,7 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
 	ns->level = level;
 	ns->parent = get_pid_ns(parent_pid_ns);
 	ns->user_ns = get_user_ns(user_ns);
+	ns->nr_hashed = PIDNS_HASH_ADDING;
 	INIT_WORK(&ns->proc_work, proc_cleanup_work);
 
 	set_bit(0, ns->pidmap[0].page);
@@ -181,6 +182,9 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 	int rc;
 	struct task_struct *task, *me = current;
 
+	/* Don't allow any more processes into the pid namespace */
+	disable_pid_allocation(pid_ns);
+
 	/* Ignore SIGCHLD causing any terminated children to autoreap */
 	spin_lock_irq(&me->sighand->siglock);
 	me->sighand->action[SIGCHLD - 1].sa.sa_handler = SIG_IGN;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH review 2/3] pidns: Stop pid allocation when init dies
  2012-12-22  4:58 ` [PATCH review 2/3] pidns: Stop pid allocation when init dies Eric W. Biederman
@ 2012-12-22 16:54   ` Oleg Nesterov
  2012-12-22 20:31     ` Eric W. Biederman
  2012-12-25  8:24     ` [PATCH review 2/3 take 2] " Eric W. Biederman
  0 siblings, 2 replies; 10+ messages in thread
From: Oleg Nesterov @ 2012-12-22 16:54 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Containers, linux-kernel, Serge E. Hallyn

On 12/21, Eric W. Biederman wrote:
>
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -21,7 +21,7 @@ struct pid_namespace {
>  	struct kref kref;
>  	struct pidmap pidmap[PIDMAP_ENTRIES];
>  	int last_pid;
> -	int nr_hashed;
> +	unsigned int nr_hashed;
>  	struct task_struct *child_reaper;
>  	struct kmem_cache *pid_cachep;
>  	unsigned int level;
> @@ -42,6 +42,8 @@ struct pid_namespace {
>
>  extern struct pid_namespace init_pid_ns;
>
> +#define PIDNS_HASH_ADDING (1U << 31)

Yes, agreed. We can't rely on PF_EXITING/whatever, we need the explicit
flag.

1/2 looks fine too. Only one nit about init_pid_ns below...

> @@ -319,7 +318,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>
>  	upid = pid->numbers + ns->level;
>  	spin_lock_irq(&pidmap_lock);
> -	if (ns->nr_hashed < 0)
> +	if (ns->nr_hashed < PIDNS_HASH_ADDING)

I won't insist, but perhaps if "(!(nr_hashed & PIDNS_HASH_ADDING))"
looks more understandable.

> +void disable_pid_allocation(struct pid_namespace *ns)
> +{
> +	spin_lock_irq(&pidmap_lock);
> +	if (ns->nr_hashed >= PIDNS_HASH_ADDING)

Do we really need this check? It seems that PIDNS_HASH_ADDING
bit must be always set when disable_pid_allocation() is called.

> +		ns->nr_hashed -= PIDNS_HASH_ADDING;

Anyway, nr_hashed &= ~PIDNS_HASH_ADDING looks simpler and doesn't
need a check.

But again, I won't insist this is minor and subjective.

>  struct pid *find_pid_ns(int nr, struct pid_namespace *ns)
>  {
>  	struct hlist_node *elem;
> @@ -584,7 +591,7 @@ void __init pidmap_init(void)
>  	/* Reserve PID 0. We never call free_pidmap(0) */
>  	set_bit(0, init_pid_ns.pidmap[0].page);
>  	atomic_dec(&init_pid_ns.pidmap[0].nr_free);
> -	init_pid_ns.nr_hashed = 1;
> +	init_pid_ns.nr_hashed = 1 + PIDNS_HASH_ADDING;

The obly chunk which doesn't look exactly correct to me, although this
doesn't really matter. Hmm, actually the code was already wrong before
this patch.

I think init_pid_ns.nr_hashed should be PIDNS_HASH_ADDING, we should not
add 1 to account the unused zero pid, and kernel_thread(kernel_init) was
not called yet.

Oleg.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH review 2/3] pidns: Stop pid allocation when init dies
  2012-12-22 16:54   ` Oleg Nesterov
@ 2012-12-22 20:31     ` Eric W. Biederman
  2012-12-25  8:24     ` [PATCH review 2/3 take 2] " Eric W. Biederman
  1 sibling, 0 replies; 10+ messages in thread
From: Eric W. Biederman @ 2012-12-22 20:31 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Linux Containers, linux-kernel, Serge E. Hallyn

Oleg Nesterov <oleg@redhat.com> writes:

> On 12/21, Eric W. Biederman wrote:
>>
>> --- a/include/linux/pid_namespace.h
>> +++ b/include/linux/pid_namespace.h
>> @@ -21,7 +21,7 @@ struct pid_namespace {
>>  	struct kref kref;
>>  	struct pidmap pidmap[PIDMAP_ENTRIES];
>>  	int last_pid;
>> -	int nr_hashed;
>> +	unsigned int nr_hashed;
>>  	struct task_struct *child_reaper;
>>  	struct kmem_cache *pid_cachep;
>>  	unsigned int level;
>> @@ -42,6 +42,8 @@ struct pid_namespace {
>>
>>  extern struct pid_namespace init_pid_ns;
>>
>> +#define PIDNS_HASH_ADDING (1U << 31)
>
> Yes, agreed. We can't rely on PF_EXITING/whatever, we need the explicit
> flag.

The simpler and more comprehensible we can make this code the better. 
We have had too many surprises in this code because of complex failure
modes.

> 1/2 looks fine too. Only one nit about init_pid_ns below...

Then I will add your acked-by to the first patch.

>> @@ -319,7 +318,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>>
>>  	upid = pid->numbers + ns->level;
>>  	spin_lock_irq(&pidmap_lock);
>> -	if (ns->nr_hashed < 0)
>> +	if (ns->nr_hashed < PIDNS_HASH_ADDING)
>
> I won't insist, but perhaps if "(!(nr_hashed & PIDNS_HASH_ADDING))"
> looks more understandable.

I will stare at it both ways and post an updated patch.

I'm not certain which form I like better.  Certainly the decrements
are doing a double duty.

>> +void disable_pid_allocation(struct pid_namespace *ns)
>> +{
>> +	spin_lock_irq(&pidmap_lock);
>> +	if (ns->nr_hashed >= PIDNS_HASH_ADDING)
>
> Do we really need this check? It seems that PIDNS_HASH_ADDING
> bit must be always set when disable_pid_allocation() is called.
>
>> +		ns->nr_hashed -= PIDNS_HASH_ADDING;
>
> Anyway, nr_hashed &= ~PIDNS_HASH_ADDING looks simpler and doesn't
> need a check.

That I agree with.

> But again, I won't insist this is minor and subjective.
>
>>  struct pid *find_pid_ns(int nr, struct pid_namespace *ns)
>>  {
>>  	struct hlist_node *elem;
>> @@ -584,7 +591,7 @@ void __init pidmap_init(void)
>>  	/* Reserve PID 0. We never call free_pidmap(0) */
>>  	set_bit(0, init_pid_ns.pidmap[0].page);
>>  	atomic_dec(&init_pid_ns.pidmap[0].nr_free);
>> -	init_pid_ns.nr_hashed = 1;
>> +	init_pid_ns.nr_hashed = 1 + PIDNS_HASH_ADDING;
>
> The obly chunk which doesn't look exactly correct to me, although this
> doesn't really matter. Hmm, actually the code was already wrong before
> this patch.
>
> I think init_pid_ns.nr_hashed should be PIDNS_HASH_ADDING, we should not
> add 1 to account the unused zero pid, and kernel_thread(kernel_init) was
> not called yet.

Good point because the zero pid does not get hashed.  Who knows perhaps
with a little more evolution create_pid_ns can be used to create the
initial pid namespace.

I am also going to add "BUILD_BUG_ON(PID_MAX_LIMIT >= PIDNS_HASH_ADDING);"
to document that the pid values and PIDNS_HASH_ADDING can't overlap.

Eric	


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH review 2/3 take 2] pidns: Stop pid allocation when init dies
  2012-12-22 16:54   ` Oleg Nesterov
  2012-12-22 20:31     ` Eric W. Biederman
@ 2012-12-25  8:24     ` Eric W. Biederman
  2012-12-25 16:59       ` Oleg Nesterov
  1 sibling, 1 reply; 10+ messages in thread
From: Eric W. Biederman @ 2012-12-25  8:24 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Linux Containers, linux-kernel, Serge E. Hallyn


Oleg pointed out that in a pid namespace the sequence.
- pid 1 becomes a zombie
- setns(thepidns), fork,...
- reaping pid 1.
- The injected processes exiting.

Can lead to processes attempting access their child reaper and
instead following a stale pointer.

That waitpid for init can return before all of the processes in
the pid namespace have exited is also unfortunate.

Avoid these problems by disabling the allocation of new pids in a pid
namespace when init dies, instead of when the last process in a pid
namespace is reaped.

Pointed-out-by:  Oleg Nesterov <oleg@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 include/linux/pid.h           |    1 +
 include/linux/pid_namespace.h |    4 +++-
 kernel/pid.c                  |   15 ++++++++++++---
 kernel/pid_namespace.c        |    4 ++++
 4 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index b152d44..2381c97 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -121,6 +121,7 @@ int next_pidmap(struct pid_namespace *pid_ns, unsigned int last);
 
 extern struct pid *alloc_pid(struct pid_namespace *ns);
 extern void free_pid(struct pid *pid);
+extern void disable_pid_allocation(struct pid_namespace *ns);
 
 /*
  * ns_of_pid() returns the pid namespace in which the specified pid was
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index bf28599..215e5e3 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -21,7 +21,7 @@ struct pid_namespace {
 	struct kref kref;
 	struct pidmap pidmap[PIDMAP_ENTRIES];
 	int last_pid;
-	int nr_hashed;
+	unsigned int nr_hashed;
 	struct task_struct *child_reaper;
 	struct kmem_cache *pid_cachep;
 	unsigned int level;
@@ -42,6 +42,8 @@ struct pid_namespace {
 
 extern struct pid_namespace init_pid_ns;
 
+#define PIDNS_HASH_ADDING (1U << 31)
+
 #ifdef CONFIG_PID_NS
 static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
 {
diff --git a/kernel/pid.c b/kernel/pid.c
index 36aa02f..de9af60 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -270,7 +270,6 @@ void free_pid(struct pid *pid)
 			wake_up_process(ns->child_reaper);
 			break;
 		case 0:
-			ns->nr_hashed = -1;
 			schedule_work(&ns->proc_work);
 			break;
 		}
@@ -319,7 +318,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	upid = pid->numbers + ns->level;
 	spin_lock_irq(&pidmap_lock);
-	if (ns->nr_hashed < 0)
+	if (!(ns->nr_hashed & PIDNS_HASH_ADDING))
 		goto out_unlock;
 	for ( ; upid >= pid->numbers; --upid) {
 		hlist_add_head_rcu(&upid->pid_chain,
@@ -342,6 +341,13 @@ out_free:
 	goto out;
 }
 
+void disable_pid_allocation(struct pid_namespace *ns)
+{
+	spin_lock_irq(&pidmap_lock);
+	ns->nr_hashed &= ~PIDNS_HASH_ADDING;
+	spin_unlock_irq(&pidmap_lock);
+}
+
 struct pid *find_pid_ns(int nr, struct pid_namespace *ns)
 {
 	struct hlist_node *elem;
@@ -573,6 +579,9 @@ void __init pidhash_init(void)
 
 void __init pidmap_init(void)
 {
+	/* Veryify no one has done anything silly */
+	BUILD_BUG_ON(PID_MAX_LIMIT >= PIDNS_HASH_ADDING);
+
 	/* bump default and minimum pid_max based on number of cpus */
 	pid_max = min(pid_max_max, max_t(int, pid_max,
 				PIDS_PER_CPU_DEFAULT * num_possible_cpus()));
@@ -584,7 +593,7 @@ void __init pidmap_init(void)
 	/* Reserve PID 0. We never call free_pidmap(0) */
 	set_bit(0, init_pid_ns.pidmap[0].page);
 	atomic_dec(&init_pid_ns.pidmap[0].nr_free);
-	init_pid_ns.nr_hashed = 1;
+	init_pid_ns.nr_hashed = PIDNS_HASH_ADDING;
 
 	init_pid_ns.pid_cachep = KMEM_CACHE(pid,
 			SLAB_HWCACHE_ALIGN | SLAB_PANIC);
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index fdbd0cd..c1c3dc1 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -115,6 +115,7 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
 	ns->level = level;
 	ns->parent = get_pid_ns(parent_pid_ns);
 	ns->user_ns = get_user_ns(user_ns);
+	ns->nr_hashed = PIDNS_HASH_ADDING;
 	INIT_WORK(&ns->proc_work, proc_cleanup_work);
 
 	set_bit(0, ns->pidmap[0].page);
@@ -181,6 +182,9 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 	int rc;
 	struct task_struct *task, *me = current;
 
+	/* Don't allow any more processes into the pid namespace */
+	disable_pid_allocation(pid_ns);
+
 	/* Ignore SIGCHLD causing any terminated children to autoreap */
 	spin_lock_irq(&me->sighand->siglock);
 	me->sighand->action[SIGCHLD - 1].sa.sa_handler = SIG_IGN;
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH review 2/3 take 2] pidns: Stop pid allocation when init dies
  2012-12-25  8:24     ` [PATCH review 2/3 take 2] " Eric W. Biederman
@ 2012-12-25 16:59       ` Oleg Nesterov
  0 siblings, 0 replies; 10+ messages in thread
From: Oleg Nesterov @ 2012-12-25 16:59 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Containers, linux-kernel, Serge E. Hallyn

On 12/25, Eric W. Biederman wrote:
>
> Avoid these problems by disabling the allocation of new pids in a pid
> namespace when init dies, instead of when the last process in a pid
> namespace is reaped.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH review 3/3] proc: Allow proc_free_inum to be called from any context
  2012-12-22  4:56 [PATCH review 0/3] pid namespaces fixes Eric W. Biederman
  2012-12-22  4:57 ` [PATCH review 1/3] pidns: Outlaw thread creation after unshare(CLONE_NEWPID) Eric W. Biederman
  2012-12-22  4:58 ` [PATCH review 2/3] pidns: Stop pid allocation when init dies Eric W. Biederman
@ 2012-12-22  4:58 ` Eric W. Biederman
  2 siblings, 0 replies; 10+ messages in thread
From: Eric W. Biederman @ 2012-12-22  4:58 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Linux Containers, linux-kernel, Serge E. Hallyn


While testing the pid namespace code I hit this nasty warning.

[  176.262617] ------------[ cut here ]------------
[  176.263388] WARNING: at /home/eric/projects/linux/linux-userns-devel/kernel/softirq.c:160 local_bh_enable_ip+0x7a/0xa0()
[  176.265145] Hardware name: Bochs
[  176.265677] Modules linked in:
[  176.266341] Pid: 742, comm: bash Not tainted 3.7.0userns+ #18
[  176.266564] Call Trace:
[  176.266564]  [<ffffffff810a539f>] warn_slowpath_common+0x7f/0xc0
[  176.266564]  [<ffffffff810a53fa>] warn_slowpath_null+0x1a/0x20
[  176.266564]  [<ffffffff810ad9ea>] local_bh_enable_ip+0x7a/0xa0
[  176.266564]  [<ffffffff819308c9>] _raw_spin_unlock_bh+0x19/0x20
[  176.266564]  [<ffffffff8123dbda>] proc_free_inum+0x3a/0x50
[  176.266564]  [<ffffffff8111d0dc>] free_pid_ns+0x1c/0x80
[  176.266564]  [<ffffffff8111d195>] put_pid_ns+0x35/0x50
[  176.266564]  [<ffffffff810c608a>] put_pid+0x4a/0x60
[  176.266564]  [<ffffffff8146b177>] tty_ioctl+0x717/0xc10
[  176.266564]  [<ffffffff810aa4d5>] ? wait_consider_task+0x855/0xb90
[  176.266564]  [<ffffffff81086bf9>] ? default_spin_lock_flags+0x9/0x10
[  176.266564]  [<ffffffff810cab0a>] ? remove_wait_queue+0x5a/0x70
[  176.266564]  [<ffffffff811e37e8>] do_vfs_ioctl+0x98/0x550
[  176.266564]  [<ffffffff810b8a0f>] ? recalc_sigpending+0x1f/0x60
[  176.266564]  [<ffffffff810b9127>] ? __set_task_blocked+0x37/0x80
[  176.266564]  [<ffffffff810ab95b>] ? sys_wait4+0xab/0xf0
[  176.266564]  [<ffffffff811e3d31>] sys_ioctl+0x91/0xb0
[  176.266564]  [<ffffffff810a95f0>] ? task_stopped_code+0x50/0x50
[  176.266564]  [<ffffffff81939199>] system_call_fastpath+0x16/0x1b
[  176.266564] ---[ end trace 387af88219ad6143 ]---

It turns out that spin_unlock_bh(proc_inum_lock) is not safe when
put_pid is called with another spinlock held and irqs disabled.

For now take the easy path and use spin_lock_irqsave(proc_inum_lock)
in proc_free_inum and spin_loc_irq in proc_alloc_inum(proc_inum_lock).

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/proc/generic.c |   13 +++++++------
 1 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index e064f56..76ddae8 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -352,18 +352,18 @@ retry:
 	if (!ida_pre_get(&proc_inum_ida, GFP_KERNEL))
 		return -ENOMEM;
 
-	spin_lock_bh(&proc_inum_lock);
+	spin_lock_irq(&proc_inum_lock);
 	error = ida_get_new(&proc_inum_ida, &i);
-	spin_unlock_bh(&proc_inum_lock);
+	spin_unlock_irq(&proc_inum_lock);
 	if (error == -EAGAIN)
 		goto retry;
 	else if (error)
 		return error;
 
 	if (i > UINT_MAX - PROC_DYNAMIC_FIRST) {
-		spin_lock_bh(&proc_inum_lock);
+		spin_lock_irq(&proc_inum_lock);
 		ida_remove(&proc_inum_ida, i);
-		spin_unlock_bh(&proc_inum_lock);
+		spin_unlock_irq(&proc_inum_lock);
 		return -ENOSPC;
 	}
 	*inum = PROC_DYNAMIC_FIRST + i;
@@ -372,9 +372,10 @@ retry:
 
 void proc_free_inum(unsigned int inum)
 {
-	spin_lock_bh(&proc_inum_lock);
+	unsigned long flags;
+	spin_lock_irqsave(&proc_inum_lock, flags);
 	ida_remove(&proc_inum_ida, inum - PROC_DYNAMIC_FIRST);
-	spin_unlock_bh(&proc_inum_lock);
+	spin_unlock_irqrestore(&proc_inum_lock, flags);
 }
 
 static void *proc_follow_link(struct dentry *dentry, struct nameidata *nd)
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-12-25 16:59 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-22  4:56 [PATCH review 0/3] pid namespaces fixes Eric W. Biederman
2012-12-22  4:57 ` [PATCH review 1/3] pidns: Outlaw thread creation after unshare(CLONE_NEWPID) Eric W. Biederman
2012-12-22 19:39   ` Rob Landley
2012-12-22 20:16     ` Eric W. Biederman
2012-12-22  4:58 ` [PATCH review 2/3] pidns: Stop pid allocation when init dies Eric W. Biederman
2012-12-22 16:54   ` Oleg Nesterov
2012-12-22 20:31     ` Eric W. Biederman
2012-12-25  8:24     ` [PATCH review 2/3 take 2] " Eric W. Biederman
2012-12-25 16:59       ` Oleg Nesterov
2012-12-22  4:58 ` [PATCH review 3/3] proc: Allow proc_free_inum to be called from any context Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox