public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/1][V3] Handle reboot in a child pid namespace
@ 2011-12-04 20:24 Daniel Lezcano
  2011-12-04 20:24 ` [PATCH 1/1][V3] Add reboot_pid_ns to handle the reboot syscall Daniel Lezcano
  2011-12-04 21:27 ` [PATCH 0/1][V3] Handle reboot in a child pid namespace Henrique de Moraes Holschuh
  0 siblings, 2 replies; 15+ messages in thread
From: Daniel Lezcano @ 2011-12-04 20:24 UTC (permalink / raw)
  To: akpm; +Cc: serge.hallyn, oleg, containers, gkurz, linux-kernel

ChangeLog:

 * V3
   - removed lock and serialization of pid_ns_reboot
 * V2
   - added a lock for the pid namespace to prevent racy call
     to the 'reboot' syscall
   - Moved 'reboot' command assigned in zap_pid_ns_processes
     instead of wait_task_zombie
   - added tasklist lock around force_sig
   - added do_exit in pid_ns_reboot
   - used task_active_pid_ns instead of declaring a new variable in sys_reboot
   - moved code up before POWER_OFF changed to HALT in sys_reboot

Daniel Lezcano (1):
  Add reboot_pid_ns to handle the reboot syscall

 include/linux/pid_namespace.h |    8 +++++++-
 kernel/pid_namespace.c        |   33 +++++++++++++++++++++++++++++++++
 kernel/sys.c                  |    3 +++
 3 files changed, 43 insertions(+), 1 deletions(-)

-- 
1.7.5.4


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/1][V3] Add reboot_pid_ns to handle the reboot syscall
  2011-12-04 20:24 [PATCH 0/1][V3] Handle reboot in a child pid namespace Daniel Lezcano
@ 2011-12-04 20:24 ` Daniel Lezcano
  2011-12-05 18:35   ` Serge Hallyn
                     ` (2 more replies)
  2011-12-04 21:27 ` [PATCH 0/1][V3] Handle reboot in a child pid namespace Henrique de Moraes Holschuh
  1 sibling, 3 replies; 15+ messages in thread
From: Daniel Lezcano @ 2011-12-04 20:24 UTC (permalink / raw)
  To: akpm; +Cc: serge.hallyn, oleg, containers, gkurz, linux-kernel

In the case of a child pid namespace, rebooting the system does not
really makes sense. When the pid namespace is used in conjunction
with the other namespaces in order to create a linux container, the
reboot syscall leads to some problems.

A container can reboot the host. That can be fixed by dropping
the sys_reboot capability but we are unable to correctly to poweroff/
halt/reboot a container and the container stays stuck at the shutdown
time with the container's init process waiting indefinitively.

After several attempts, no solution from userspace was found to reliabily
handle the shutdown from a container.

This patch propose to store the reboot value in the 16 upper bits of the
exit code from the processes belonging to a pid namespace which has
rebooted. When the reboot syscall is called and we are not in the initial
pid namespace, we kill the pid namespace.

By this way the parent process of the child pid namespace to know if
it rebooted or not and take the right decision.

Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
---
 include/linux/pid_namespace.h |    8 +++++++-
 kernel/pid_namespace.c        |   33 +++++++++++++++++++++++++++++++++
 kernel/sys.c                  |    3 +++
 3 files changed, 43 insertions(+), 1 deletions(-)

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index e7cf666..3279596 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -32,6 +32,7 @@ struct pid_namespace {
 #endif
 	gid_t pid_gid;
 	int hide_pid;
+	int reboot;
 };
 
 extern struct pid_namespace init_pid_ns;
@@ -47,6 +48,7 @@ static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
 extern struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *ns);
 extern void free_pid_ns(struct kref *kref);
 extern void zap_pid_ns_processes(struct pid_namespace *pid_ns);
+extern int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd);
 
 static inline void put_pid_ns(struct pid_namespace *ns)
 {
@@ -74,11 +76,15 @@ static inline void put_pid_ns(struct pid_namespace *ns)
 {
 }
 
-
 static inline void zap_pid_ns_processes(struct pid_namespace *ns)
 {
 	BUG();
 }
+
+static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
+{
+	BUG();
+}
 #endif /* CONFIG_PID_NS */
 
 extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index a896839..c7a85ea 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -15,6 +15,7 @@
 #include <linux/acct.h>
 #include <linux/slab.h>
 #include <linux/proc_fs.h>
+#include <linux/reboot.h>
 
 #define BITS_PER_PAGE		(PAGE_SIZE*8)
 
@@ -187,6 +188,9 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 		rc = sys_wait4(-1, NULL, __WALL, NULL);
 	} while (rc != -ECHILD);
 
+	if (pid_ns->reboot)
+		current->signal->group_exit_code = pid_ns->reboot;
+
 	acct_exit_ns(pid_ns);
 	return;
 }
@@ -221,6 +225,35 @@ static struct ctl_table pid_ns_ctl_table[] = {
 
 static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } };
 
+int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
+{
+	switch(cmd) {
+	case LINUX_REBOOT_CMD_RESTART2:
+	case LINUX_REBOOT_CMD_RESTART:
+		pid_ns->reboot = SYSTEM_RESTART << 16;
+		break;
+
+	case LINUX_REBOOT_CMD_HALT:
+		pid_ns->reboot = SYSTEM_HALT << 16;
+		break;
+
+	case LINUX_REBOOT_CMD_POWER_OFF:
+		pid_ns->reboot = SYSTEM_POWER_OFF << 16;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	read_lock(&tasklist_lock);
+	force_sig(SIGKILL, pid_ns->child_reaper);
+	read_unlock(&tasklist_lock);
+
+	do_exit(0);
+
+	/* Not reached */
+	return 0;
+}
+
 static __init int pid_namespaces_init(void)
 {
 	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
diff --git a/kernel/sys.c b/kernel/sys.c
index ddf8155..31acf63 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -444,6 +444,9 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd,
 	                magic2 != LINUX_REBOOT_MAGIC2C))
 		return -EINVAL;
 
+	if (task_active_pid_ns(current) != &init_pid_ns)
+		return reboot_pid_ns(task_active_pid_ns(current), cmd);
+
 	/* Instead of trying to make the power_off code look like
 	 * halt when pm_power_off is not set do it the easy way.
 	 */
-- 
1.7.5.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1][V3] Handle reboot in a child pid namespace
  2011-12-04 20:24 [PATCH 0/1][V3] Handle reboot in a child pid namespace Daniel Lezcano
  2011-12-04 20:24 ` [PATCH 1/1][V3] Add reboot_pid_ns to handle the reboot syscall Daniel Lezcano
@ 2011-12-04 21:27 ` Henrique de Moraes Holschuh
  2011-12-04 23:08   ` Daniel Lezcano
  1 sibling, 1 reply; 15+ messages in thread
From: Henrique de Moraes Holschuh @ 2011-12-04 21:27 UTC (permalink / raw)
  To: Daniel Lezcano; +Cc: akpm, serge.hallyn, oleg, containers, gkurz, linux-kernel

On Sun, 04 Dec 2011, Daniel Lezcano wrote:
>  * V3
>    - removed lock and serialization of pid_ns_reboot
>  * V2
>    - added a lock for the pid namespace to prevent racy call
>      to the 'reboot' syscall
>    - Moved 'reboot' command assigned in zap_pid_ns_processes
>      instead of wait_task_zombie
>    - added tasklist lock around force_sig
>    - added do_exit in pid_ns_reboot
>    - used task_active_pid_ns instead of declaring a new variable in sys_reboot
>    - moved code up before POWER_OFF changed to HALT in sys_reboot

Daniel, can you address Miquel's concern?  Is it a valid concern, or
not?  I assume CAP_REBOOT functionality is still in place inside the
container, so it really does look like userspace would need to know
whether it should drop CAP_REBOOT or not, in order to automatically use
the new feature.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1][V3] Handle reboot in a child pid namespace
  2011-12-04 21:27 ` [PATCH 0/1][V3] Handle reboot in a child pid namespace Henrique de Moraes Holschuh
@ 2011-12-04 23:08   ` Daniel Lezcano
  2011-12-05 20:49     ` Daniel Lezcano
  2011-12-05 20:50     ` Oleg Nesterov
  0 siblings, 2 replies; 15+ messages in thread
From: Daniel Lezcano @ 2011-12-04 23:08 UTC (permalink / raw)
  To: Henrique de Moraes Holschuh
  Cc: akpm, serge.hallyn, oleg, containers, gkurz, linux-kernel

On 12/04/2011 10:27 PM, Henrique de Moraes Holschuh wrote:
> On Sun, 04 Dec 2011, Daniel Lezcano wrote:
>>  * V3
>>    - removed lock and serialization of pid_ns_reboot
>>  * V2
>>    - added a lock for the pid namespace to prevent racy call
>>      to the 'reboot' syscall
>>    - Moved 'reboot' command assigned in zap_pid_ns_processes
>>      instead of wait_task_zombie
>>    - added tasklist lock around force_sig
>>    - added do_exit in pid_ns_reboot
>>    - used task_active_pid_ns instead of declaring a new variable in sys_reboot
>>    - moved code up before POWER_OFF changed to HALT in sys_reboot
> Daniel, can you address Miquel's concern?  Is it a valid concern, or
> not?  I assume CAP_REBOOT functionality is still in place inside the
> container, so it really does look like userspace would need to know
> whether it should drop CAP_REBOOT or not, in order to automatically use
> the new feature.

Hmm, I missed its email. I think it is worth to have such ability to
detect how behaves the reboot syscall vs the pid ns. At present, if we
call 'reboot' in a child pid namespace, that will affect the host, we
are changing this behavior with this patch. I don't think there is any
application doing a shutdown from a child pid namespace, that don't
makes sense as the shutdown is invoked after killing all the processes
on the system and that could only be done from the init_pid_ns.

I would like to address this in a separate patch in order to discuss the
best way to do that. Adding a fake 'reboot' parameter returning EINVAL
or 0 seems a good solution to detect at runtime if the shutdown is
correctly supported inside a container.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1][V3] Add reboot_pid_ns to handle the reboot syscall
  2011-12-04 20:24 ` [PATCH 1/1][V3] Add reboot_pid_ns to handle the reboot syscall Daniel Lezcano
@ 2011-12-05 18:35   ` Serge Hallyn
  2011-12-05 20:42   ` Oleg Nesterov
  2011-12-07  1:16   ` Andrew Morton
  2 siblings, 0 replies; 15+ messages in thread
From: Serge Hallyn @ 2011-12-05 18:35 UTC (permalink / raw)
  To: Daniel Lezcano; +Cc: akpm, oleg, containers, gkurz, linux-kernel

On 12/04/2011 02:24 PM, Daniel Lezcano wrote:
> In the case of a child pid namespace, rebooting the system does not
> really makes sense. When the pid namespace is used in conjunction
> with the other namespaces in order to create a linux container, the
> reboot syscall leads to some problems.
>
> A container can reboot the host. That can be fixed by dropping
> the sys_reboot capability but we are unable to correctly to poweroff/
> halt/reboot a container and the container stays stuck at the shutdown
> time with the container's init process waiting indefinitively.
>
> After several attempts, no solution from userspace was found to reliabily
> handle the shutdown from a container.
>
> This patch propose to store the reboot value in the 16 upper bits of the
> exit code from the processes belonging to a pid namespace which has
> rebooted. When the reboot syscall is called and we are not in the initial
> pid namespace, we kill the pid namespace.
>
> By this way the parent process of the child pid namespace to know if
> it rebooted or not and take the right decision.
>
> Signed-off-by: Daniel Lezcano<daniel.lezcano@free.fr>

Tested-by: Serge Hallyn <serge.hallyn@canonical.com>

Tested it with reboot(2), worked as expected.

thanks,
-serge

> Acked-by: Serge Hallyn<serge.hallyn@canonical.com>
> ---
>   include/linux/pid_namespace.h |    8 +++++++-
>   kernel/pid_namespace.c        |   33 +++++++++++++++++++++++++++++++++
>   kernel/sys.c                  |    3 +++
>   3 files changed, 43 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index e7cf666..3279596 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -32,6 +32,7 @@ struct pid_namespace {
>   #endif
>   	gid_t pid_gid;
>   	int hide_pid;
> +	int reboot;
>   };
>
>   extern struct pid_namespace init_pid_ns;
> @@ -47,6 +48,7 @@ static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
>   extern struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *ns);
>   extern void free_pid_ns(struct kref *kref);
>   extern void zap_pid_ns_processes(struct pid_namespace *pid_ns);
> +extern int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd);
>
>   static inline void put_pid_ns(struct pid_namespace *ns)
>   {
> @@ -74,11 +76,15 @@ static inline void put_pid_ns(struct pid_namespace *ns)
>   {
>   }
>
> -
>   static inline void zap_pid_ns_processes(struct pid_namespace *ns)
>   {
>   	BUG();
>   }
> +
> +static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
> +{
> +	BUG();
> +}
>   #endif /* CONFIG_PID_NS */
>
>   extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index a896839..c7a85ea 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -15,6 +15,7 @@
>   #include<linux/acct.h>
>   #include<linux/slab.h>
>   #include<linux/proc_fs.h>
> +#include<linux/reboot.h>
>
>   #define BITS_PER_PAGE		(PAGE_SIZE*8)
>
> @@ -187,6 +188,9 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>   		rc = sys_wait4(-1, NULL, __WALL, NULL);
>   	} while (rc != -ECHILD);
>
> +	if (pid_ns->reboot)
> +		current->signal->group_exit_code = pid_ns->reboot;
> +
>   	acct_exit_ns(pid_ns);
>   	return;
>   }
> @@ -221,6 +225,35 @@ static struct ctl_table pid_ns_ctl_table[] = {
>
>   static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } };
>
> +int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
> +{
> +	switch(cmd) {
> +	case LINUX_REBOOT_CMD_RESTART2:
> +	case LINUX_REBOOT_CMD_RESTART:
> +		pid_ns->reboot = SYSTEM_RESTART<<  16;
> +		break;
> +
> +	case LINUX_REBOOT_CMD_HALT:
> +		pid_ns->reboot = SYSTEM_HALT<<  16;
> +		break;
> +
> +	case LINUX_REBOOT_CMD_POWER_OFF:
> +		pid_ns->reboot = SYSTEM_POWER_OFF<<  16;
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	read_lock(&tasklist_lock);
> +	force_sig(SIGKILL, pid_ns->child_reaper);
> +	read_unlock(&tasklist_lock);
> +
> +	do_exit(0);
> +
> +	/* Not reached */
> +	return 0;
> +}
> +
>   static __init int pid_namespaces_init(void)
>   {
>   	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
> diff --git a/kernel/sys.c b/kernel/sys.c
> index ddf8155..31acf63 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -444,6 +444,9 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd,
>   	                magic2 != LINUX_REBOOT_MAGIC2C))
>   		return -EINVAL;
>
> +	if (task_active_pid_ns(current) !=&init_pid_ns)
> +		return reboot_pid_ns(task_active_pid_ns(current), cmd);
> +
>   	/* Instead of trying to make the power_off code look like
>   	 * halt when pm_power_off is not set do it the easy way.
>   	 */


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1][V3] Add reboot_pid_ns to handle the reboot syscall
  2011-12-04 20:24 ` [PATCH 1/1][V3] Add reboot_pid_ns to handle the reboot syscall Daniel Lezcano
  2011-12-05 18:35   ` Serge Hallyn
@ 2011-12-05 20:42   ` Oleg Nesterov
  2011-12-05 21:16     ` Daniel Lezcano
  2011-12-05 21:17     ` Daniel Lezcano
  2011-12-07  1:16   ` Andrew Morton
  2 siblings, 2 replies; 15+ messages in thread
From: Oleg Nesterov @ 2011-12-05 20:42 UTC (permalink / raw)
  To: Daniel Lezcano; +Cc: akpm, serge.hallyn, containers, gkurz, linux-kernel

On 12/04, Daniel Lezcano wrote:
>
> This patch propose to store the reboot value in the 16 upper bits of the
> exit code from the processes belonging to a pid namespace which has
> rebooted. When the reboot syscall is called and we are not in the initial
> pid namespace, we kill the pid namespace.
>
> By this way the parent process of the child pid namespace to know if
> it rebooted or not and take the right decision.
>
> Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
> Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
> ---
>  include/linux/pid_namespace.h |    8 +++++++-
>  kernel/pid_namespace.c        |   33 +++++++++++++++++++++++++++++++++
>  kernel/sys.c                  |    3 +++
>  3 files changed, 43 insertions(+), 1 deletions(-)

Signed-off-by: Oleg Nesterov <oleg@redhat.com>


> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index e7cf666..3279596 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -32,6 +32,7 @@ struct pid_namespace {
>  #endif
>  	gid_t pid_gid;
>  	int hide_pid;
> +	int reboot;
>  };
>  
>  extern struct pid_namespace init_pid_ns;
> @@ -47,6 +48,7 @@ static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
>  extern struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *ns);
>  extern void free_pid_ns(struct kref *kref);
>  extern void zap_pid_ns_processes(struct pid_namespace *pid_ns);
> +extern int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd);
>  
>  static inline void put_pid_ns(struct pid_namespace *ns)
>  {
> @@ -74,11 +76,15 @@ static inline void put_pid_ns(struct pid_namespace *ns)
>  {
>  }
>  
> -
>  static inline void zap_pid_ns_processes(struct pid_namespace *ns)
>  {
>  	BUG();
>  }
> +
> +static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
> +{
> +	BUG();
> +}
>  #endif /* CONFIG_PID_NS */
>  
>  extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index a896839..c7a85ea 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -15,6 +15,7 @@
>  #include <linux/acct.h>
>  #include <linux/slab.h>
>  #include <linux/proc_fs.h>
> +#include <linux/reboot.h>
>  
>  #define BITS_PER_PAGE		(PAGE_SIZE*8)
>  
> @@ -187,6 +188,9 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>  		rc = sys_wait4(-1, NULL, __WALL, NULL);
>  	} while (rc != -ECHILD);
>  
> +	if (pid_ns->reboot)
> +		current->signal->group_exit_code = pid_ns->reboot;
> +
>  	acct_exit_ns(pid_ns);
>  	return;
>  }
> @@ -221,6 +225,35 @@ static struct ctl_table pid_ns_ctl_table[] = {
>  
>  static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } };
>  
> +int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
> +{
> +	switch(cmd) {
> +	case LINUX_REBOOT_CMD_RESTART2:
> +	case LINUX_REBOOT_CMD_RESTART:
> +		pid_ns->reboot = SYSTEM_RESTART << 16;
> +		break;
> +
> +	case LINUX_REBOOT_CMD_HALT:
> +		pid_ns->reboot = SYSTEM_HALT << 16;
> +		break;
> +
> +	case LINUX_REBOOT_CMD_POWER_OFF:
> +		pid_ns->reboot = SYSTEM_POWER_OFF << 16;
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	read_lock(&tasklist_lock);
> +	force_sig(SIGKILL, pid_ns->child_reaper);
> +	read_unlock(&tasklist_lock);
> +
> +	do_exit(0);
> +
> +	/* Not reached */
> +	return 0;
> +}
> +
>  static __init int pid_namespaces_init(void)
>  {
>  	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC);
> diff --git a/kernel/sys.c b/kernel/sys.c
> index ddf8155..31acf63 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -444,6 +444,9 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd,
>  	                magic2 != LINUX_REBOOT_MAGIC2C))
>  		return -EINVAL;
>  
> +	if (task_active_pid_ns(current) != &init_pid_ns)
> +		return reboot_pid_ns(task_active_pid_ns(current), cmd);
> +
>  	/* Instead of trying to make the power_off code look like
>  	 * halt when pm_power_off is not set do it the easy way.
>  	 */
> -- 
> 1.7.5.4
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1][V3] Handle reboot in a child pid namespace
  2011-12-04 23:08   ` Daniel Lezcano
@ 2011-12-05 20:49     ` Daniel Lezcano
  2011-12-05 20:51       ` Oleg Nesterov
  2011-12-05 20:50     ` Oleg Nesterov
  1 sibling, 1 reply; 15+ messages in thread
From: Daniel Lezcano @ 2011-12-05 20:49 UTC (permalink / raw)
  To: oleg
  Cc: Henrique de Moraes Holschuh, containers, linux-kernel, akpm,
	Serge E. Hallyn

On 12/05/2011 12:08 AM, Daniel Lezcano wrote:
> On 12/04/2011 10:27 PM, Henrique de Moraes Holschuh wrote:
>> On Sun, 04 Dec 2011, Daniel Lezcano wrote:
>>>  * V3
>>>    - removed lock and serialization of pid_ns_reboot
>>>  * V2
>>>    - added a lock for the pid namespace to prevent racy call
>>>      to the 'reboot' syscall
>>>    - Moved 'reboot' command assigned in zap_pid_ns_processes
>>>      instead of wait_task_zombie
>>>    - added tasklist lock around force_sig
>>>    - added do_exit in pid_ns_reboot
>>>    - used task_active_pid_ns instead of declaring a new variable in sys_reboot
>>>    - moved code up before POWER_OFF changed to HALT in sys_reboot

Oleg ? Are you fine with this patch ?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1][V3] Handle reboot in a child pid namespace
  2011-12-04 23:08   ` Daniel Lezcano
  2011-12-05 20:49     ` Daniel Lezcano
@ 2011-12-05 20:50     ` Oleg Nesterov
  2011-12-05 22:38       ` Miquel van Smoorenburg
  1 sibling, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2011-12-05 20:50 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Henrique de Moraes Holschuh, akpm, serge.hallyn, containers,
	gkurz, linux-kernel

On 12/05, Daniel Lezcano wrote:
>
> On 12/04/2011 10:27 PM, Henrique de Moraes Holschuh wrote:
> > On Sun, 04 Dec 2011, Daniel Lezcano wrote:
> > Daniel, can you address Miquel's concern?  Is it a valid concern, or
> > not?  I assume CAP_REBOOT functionality is still in place inside the
> > container, so it really does look like userspace would need to know
> > whether it should drop CAP_REBOOT or not, in order to automatically use
> > the new feature.
>
> Hmm, I missed its email.

Me too... so I am not sure I really understand the problem.

> I would like to address this in a separate patch in order to discuss the
> best way to do that.

Agreed.

> Adding a fake 'reboot' parameter returning EINVAL
> or 0 seems a good solution to detect at runtime if the shutdown is
> correctly supported inside a container.

Or, perhaps, we can implement sys_reboot(REBOOT_SHOULD_NOT_WORK),
sub-init can call it to disable the shutdown ?

This needs the trivial modifications in zap_pid_ns_processes()
and reboot_pid_ns().

Oleg.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1][V3] Handle reboot in a child pid namespace
  2011-12-05 20:49     ` Daniel Lezcano
@ 2011-12-05 20:51       ` Oleg Nesterov
  0 siblings, 0 replies; 15+ messages in thread
From: Oleg Nesterov @ 2011-12-05 20:51 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Henrique de Moraes Holschuh, containers, linux-kernel, akpm,
	Serge E. Hallyn

On 12/05, Daniel Lezcano wrote:
>
> On 12/05/2011 12:08 AM, Daniel Lezcano wrote:
> > On 12/04/2011 10:27 PM, Henrique de Moraes Holschuh wrote:
> >> On Sun, 04 Dec 2011, Daniel Lezcano wrote:
> >>>  * V3
> >>>    - removed lock and serialization of pid_ns_reboot
> >>>  * V2
> >>>    - added a lock for the pid namespace to prevent racy call
> >>>      to the 'reboot' syscall
> >>>    - Moved 'reboot' command assigned in zap_pid_ns_processes
> >>>      instead of wait_task_zombie
> >>>    - added tasklist lock around force_sig
> >>>    - added do_exit in pid_ns_reboot
> >>>    - used task_active_pid_ns instead of declaring a new variable in sys_reboot
> >>>    - moved code up before POWER_OFF changed to HALT in sys_reboot
>
> Oleg ? Are you fine with this patch ?

I already sent my reviwed-by ;)

Oleg.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1][V3] Add reboot_pid_ns to handle the reboot syscall
  2011-12-05 20:42   ` Oleg Nesterov
@ 2011-12-05 21:16     ` Daniel Lezcano
  2011-12-05 21:17     ` Daniel Lezcano
  1 sibling, 0 replies; 15+ messages in thread
From: Daniel Lezcano @ 2011-12-05 21:16 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: akpm, serge.hallyn, containers, gkurz, linux-kernel

On 12/05/2011 09:42 PM, Oleg Nesterov wrote:
> On 12/04, Daniel Lezcano wrote:
>> This patch propose to store the reboot value in the 16 upper bits of the
>> exit code from the processes belonging to a pid namespace which has
>> rebooted. When the reboot syscall is called and we are not in the initial
>> pid namespace, we kill the pid namespace.
>>
>> By this way the parent process of the child pid namespace to know if
>> it rebooted or not and take the right decision.
>>
>> Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
>> Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
>> ---
>>  include/linux/pid_namespace.h |    8 +++++++-
>>  kernel/pid_namespace.c        |   33 +++++++++++++++++++++++++++++++++
>>  kernel/sys.c                  |    3 +++
>>  3 files changed, 43 insertions(+), 1 deletions(-)
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Looks like our emails crossed by the wire :)
Thanks again Oleg for your comments and your help.

  -- Daniel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1][V3] Add reboot_pid_ns to handle the reboot syscall
  2011-12-05 20:42   ` Oleg Nesterov
  2011-12-05 21:16     ` Daniel Lezcano
@ 2011-12-05 21:17     ` Daniel Lezcano
  1 sibling, 0 replies; 15+ messages in thread
From: Daniel Lezcano @ 2011-12-05 21:17 UTC (permalink / raw)
  To: akpm; +Cc: Oleg Nesterov, serge.hallyn, containers, gkurz, linux-kernel

On 12/05/2011 09:42 PM, Oleg Nesterov wrote:
> On 12/04, Daniel Lezcano wrote:
>> This patch propose to store the reboot value in the 16 upper bits of the
>> exit code from the processes belonging to a pid namespace which has
>> rebooted. When the reboot syscall is called and we are not in the initial
>> pid namespace, we kill the pid namespace.
>>
>> By this way the parent process of the child pid namespace to know if
>> it rebooted or not and take the right decision.
>>
>> Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
>> Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
>> ---
>>  include/linux/pid_namespace.h |    8 +++++++-
>>  kernel/pid_namespace.c        |   33 +++++++++++++++++++++++++++++++++
>>  kernel/sys.c                  |    3 +++
>>  3 files changed, 43 insertions(+), 1 deletions(-)
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Andrew, do you think it's possible to take this patch upstream ?

Thanks
  -- Daniel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1][V3] Handle reboot in a child pid namespace
  2011-12-05 20:50     ` Oleg Nesterov
@ 2011-12-05 22:38       ` Miquel van Smoorenburg
  0 siblings, 0 replies; 15+ messages in thread
From: Miquel van Smoorenburg @ 2011-12-05 22:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Daniel Lezcano, Henrique de Moraes Holschuh, Serge Hallyn,
	Miquel van Smoorenburg, linux-kernel

In article <xs4all.20111205205035.GB7422@redhat.com> you write:
>On 12/05, Daniel Lezcano wrote:
>>
>> On 12/04/2011 10:27 PM, Henrique de Moraes Holschuh wrote:
>> > On Sun, 04 Dec 2011, Daniel Lezcano wrote:
>> > Daniel, can you address Miquel's concern?  Is it a valid concern, or
>> > not?  I assume CAP_REBOOT functionality is still in place inside the
>> > container, so it really does look like userspace would need to know
>> > whether it should drop CAP_REBOOT or not, in order to automatically use
>> > the new feature.
>>
>> Hmm, I missed its email.
>
>Me too... so I am not sure I really understand the problem.

In order to use this new functionality, a container has to have
the CAP_REBOOT capability.

So if that container setup runs on a modern kernel with this patch
applied all is well and sys_reboot() will just reboot the container.
But on an older kernel, that sys_reboot() call will reboot the host.
You really really want to prevent that.

So there should be a way for the system setting up the container
(e.g. lxc-start) to know if this new reboot-the-container-instead-
of-the-host is implemented. If not, it should drop CAP_REBOOT.

That's why I proposed adding a LINUX_REBOOT_CMD_ISCONTAINER
(or whatever) pseudo command for sys_reboot that would only
return 0 for the reboot-the-name-space version of sys_reboot().

lxc-start or equivalent would then do:

	if (sys_reboot(LINUX_REBOOT_CMD_ISCONTAINER) != 0)
		cap_drop(CAP_REBOOT);

Mike.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1][V3] Add reboot_pid_ns to handle the reboot syscall
  2011-12-04 20:24 ` [PATCH 1/1][V3] Add reboot_pid_ns to handle the reboot syscall Daniel Lezcano
  2011-12-05 18:35   ` Serge Hallyn
  2011-12-05 20:42   ` Oleg Nesterov
@ 2011-12-07  1:16   ` Andrew Morton
  2011-12-07 15:12     ` Oleg Nesterov
  2011-12-07 21:36     ` Daniel Lezcano
  2 siblings, 2 replies; 15+ messages in thread
From: Andrew Morton @ 2011-12-07  1:16 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: serge.hallyn, oleg, containers, gkurz, linux-kernel,
	Michael Kerrisk, Ulrich Drepper

On Sun,  4 Dec 2011 21:24:50 +0100
Daniel Lezcano <daniel.lezcano@free.fr> wrote:

> In the case of a child pid namespace, rebooting the system does not
> really makes sense. When the pid namespace is used in conjunction
> with the other namespaces in order to create a linux container, the
> reboot syscall leads to some problems.
> 
> A container can reboot the host. That can be fixed by dropping
> the sys_reboot capability but we are unable to correctly to poweroff/
> halt/reboot a container and the container stays stuck at the shutdown
> time with the container's init process waiting indefinitively.
> 
> After several attempts, no solution from userspace was found to reliabily
> handle the shutdown from a container.
> 
> This patch propose to store the reboot value in the 16 upper bits of the
> exit code from the processes belonging to a pid namespace which has
> rebooted. When the reboot syscall is called and we are not in the initial
> pid namespace, we kill the pid namespace.
> 
> By this way the parent process of the child pid namespace to know if
> it rebooted or not and take the right decision.

hm, modifying the exit code in this manner is a strange interface.  I
didn't see that coming.  Perhaps some additional justification for this
idea should be added to the changelog, along with discussion of
alternative schemes.  I don't immediately see any problems with it,
but, odd...  I wonder what potential it has to upset existing
userspace.

Also, do we need to reserve all upper 16 bits of the exit code?  It
would be better to formally leave some free for future use.

Also, this alters the interface of wait4() and friends, yes?  If so,
that should be documented in the manpages.  Please Cc Michael on such
changes.

Also, this affects the data delivered by taskstats, I believe.  Please
check this, test it, document it in the changelog and update
getdelays.c appropriately.

Also, glibc might be affected.  For symmetry we might want to add a
WIFREBOOT() or something.  And we now expect waitid() to fill in extra
bits in siginfo_t.si_status, which assumes that glibc (and other
libc's!) aren't using a u8 in there somewhere.  etcetera.  This all
should be tested, and reviewed by Uli (please).

>
> ...
>
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -444,6 +444,9 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd,
>  	                magic2 != LINUX_REBOOT_MAGIC2C))
>  		return -EINVAL;
>  
> +	if (task_active_pid_ns(current) != &init_pid_ns)
> +		return reboot_pid_ns(task_active_pid_ns(current), cmd);

This adds a bunch of useless code if CONFIG_PID_NS=n.  It would be
better to do

#ifdef CONFIG_PID_NS
extern void pidns_handle_reboot(int cmd);
#else
static inline void pidns_handle_reboot(int cmd)
{
}
#endif



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1][V3] Add reboot_pid_ns to handle the reboot syscall
  2011-12-07  1:16   ` Andrew Morton
@ 2011-12-07 15:12     ` Oleg Nesterov
  2011-12-07 21:36     ` Daniel Lezcano
  1 sibling, 0 replies; 15+ messages in thread
From: Oleg Nesterov @ 2011-12-07 15:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Daniel Lezcano, serge.hallyn, containers, gkurz, linux-kernel,
	Michael Kerrisk, Ulrich Drepper

On 12/06, Andrew Morton wrote:
> On Sun,  4 Dec 2011 21:24:50 +0100
> Daniel Lezcano <daniel.lezcano@free.fr> wrote:
>
> > This patch propose to store the reboot value in the 16 upper bits of the
> > exit code from the processes belonging to a pid namespace which has
> > rebooted. When the reboot syscall is called and we are not in the initial
> > pid namespace, we kill the pid namespace.
> >
> > By this way the parent process of the child pid namespace to know if
> > it rebooted or not and take the right decision.
>
> hm, modifying the exit code in this manner is a strange interface.  I
> didn't see that coming.  Perhaps some additional justification for this
> idea should be added to the changelog, along with discussion of
> alternative schemes.  I don't immediately see any problems with it,
> but, odd...  I wonder what potential it has to upset existing
> userspace.

Alternatively, we could do something like

	switch (reboot) {
	case LINUX_REBOOT_CMD_RESTART:
		exit_code = SIGHUP;
		break;
	case LINUX_REBOOT_CMD_HALT:
		exit_code = SIGINT;
		break;
	...
	}

this way the parent can check WIFSIGNALED/WTERMSIG instead of upper bits.
This was the initial suggestion, and personally I like this more.


But I do not think this can upset existing userspace. __WEXITSTATUS()
reports the lower bits only, it can't see the extra info we add.

> Also, this affects the data delivered by taskstats, I believe.  Please
> check this, test it, document it in the changelog and update
> getdelays.c appropriately.

No, taskstats report ->exit_code. This doesn't look right btw. But
in any case I do not think this can break something.

> Also, glibc might be affected.  For symmetry we might want to add a
> WIFREBOOT() or something.

We already use these upper bits to report the ptrace events, in the
same manner. I do not think this has something to do with libc.

Although it could probably have another macro to read this info.

> And we now expect waitid() to fill in extra
> bits in siginfo_t.si_status, which assumes that glibc (and other
> libc's!) aren't using a u8 in there somewhere.  etcetera.  This all
> should be tested, and reviewed by Uli (please).

Again, ptrace already puts the extra info this way when the tracee
stops.

Oleg.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1][V3] Add reboot_pid_ns to handle the reboot syscall
  2011-12-07  1:16   ` Andrew Morton
  2011-12-07 15:12     ` Oleg Nesterov
@ 2011-12-07 21:36     ` Daniel Lezcano
  1 sibling, 0 replies; 15+ messages in thread
From: Daniel Lezcano @ 2011-12-07 21:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: serge.hallyn, oleg, containers, gkurz, linux-kernel,
	Michael Kerrisk, Ulrich Drepper

On 12/07/2011 02:16 AM, Andrew Morton wrote:
> On Sun,  4 Dec 2011 21:24:50 +0100
> Daniel Lezcano <daniel.lezcano@free.fr> wrote:
>
>> In the case of a child pid namespace, rebooting the system does not
>> really makes sense. When the pid namespace is used in conjunction
>> with the other namespaces in order to create a linux container, the
>> reboot syscall leads to some problems.
>>
>> A container can reboot the host. That can be fixed by dropping
>> the sys_reboot capability but we are unable to correctly to poweroff/
>> halt/reboot a container and the container stays stuck at the shutdown
>> time with the container's init process waiting indefinitively.
>>
>> After several attempts, no solution from userspace was found to reliabily
>> handle the shutdown from a container.
>>
>> This patch propose to store the reboot value in the 16 upper bits of the
>> exit code from the processes belonging to a pid namespace which has
>> rebooted. When the reboot syscall is called and we are not in the initial
>> pid namespace, we kill the pid namespace.
>>
>> By this way the parent process of the child pid namespace to know if
>> it rebooted or not and take the right decision.
> hm, modifying the exit code in this manner is a strange interface.  I
> didn't see that coming.  Perhaps some additional justification for this
> idea should be added to the changelog, along with discussion of
> alternative schemes.  I don't immediately see any problems with it,
> but, odd...  I wonder what potential it has to upset existing
> userspace.

At the first glance, that shouldn't interact with the userspace application.
This happens only when waiting the init process of a child pid namespace
if one process of this namespace call 'reboot'. I don't have the
pretension to know all the applications of the world but this case
should be very rare expect for the containers implementations, lxc and
libvirt's LXC, which need this feature.

Well written applications should use WIFEXITED, WIFSIGNALED, ... These
macros mask the upper bits of the exit code. IMHO, the applications
shouldn't be impacted by this modification.

> Also, do we need to reserve all upper 16 bits of the exit code?  It
> would be better to formally leave some free for future use.

Yes, that's right, 4 bits should be enough.
> Also, this alters the interface of wait4() and friends, yes?  If so,
> that should be documented in the manpages.  Please Cc Michael on such
> changes.

Sure, I will add a more precise description in the changelog for Michael.

> Also, this affects the data delivered by taskstats, I believe.  Please
> check this, test it, document it in the changelog and update
> getdelays.c appropriately.

Thanks for spotting this. I will check.

> Also, glibc might be affected.  For symmetry we might want to add a
> WIFREBOOT() or something.  And we now expect waitid() to fill in extra
> bits in siginfo_t.si_status, which assumes that glibc (and other
> libc's!) aren't using a u8 in there somewhere.  etcetera.  This all
> should be tested, and reviewed by Uli (please).

Ok, I will add some test cases to the the patch 0/1 with the macro
definition and in addition a better description.

Thanks for your comments.

  -- Daniel

>> ...
>>
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -444,6 +444,9 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd,
>>  	                magic2 != LINUX_REBOOT_MAGIC2C))
>>  		return -EINVAL;
>>  
>> +	if (task_active_pid_ns(current) != &init_pid_ns)
>> +		return reboot_pid_ns(task_active_pid_ns(current), cmd);
> This adds a bunch of useless code if CONFIG_PID_NS=n.  It would be
> better to do
>
> #ifdef CONFIG_PID_NS
> extern void pidns_handle_reboot(int cmd);
> #else
> static inline void pidns_handle_reboot(int cmd)
> {
> }
> #endif
>
>
>

    * Unknown - detected
    * English
    * French

    * English
    * French

<javascript:void(0);><#>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-12-07 21:37 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-04 20:24 [PATCH 0/1][V3] Handle reboot in a child pid namespace Daniel Lezcano
2011-12-04 20:24 ` [PATCH 1/1][V3] Add reboot_pid_ns to handle the reboot syscall Daniel Lezcano
2011-12-05 18:35   ` Serge Hallyn
2011-12-05 20:42   ` Oleg Nesterov
2011-12-05 21:16     ` Daniel Lezcano
2011-12-05 21:17     ` Daniel Lezcano
2011-12-07  1:16   ` Andrew Morton
2011-12-07 15:12     ` Oleg Nesterov
2011-12-07 21:36     ` Daniel Lezcano
2011-12-04 21:27 ` [PATCH 0/1][V3] Handle reboot in a child pid namespace Henrique de Moraes Holschuh
2011-12-04 23:08   ` Daniel Lezcano
2011-12-05 20:49     ` Daniel Lezcano
2011-12-05 20:51       ` Oleg Nesterov
2011-12-05 20:50     ` Oleg Nesterov
2011-12-05 22:38       ` Miquel van Smoorenburg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox