* [PATCH 0/1][V3] Handle reboot in a child pid namespace
@ 2011-12-04 20:24 Daniel Lezcano
[not found] ` <1323030290-22216-1-git-send-email-daniel.lezcano-GANU6spQydw@public.gmane.org>
[not found] ` <20111204212756.GB16362@khazad-dum.debian.net>
0 siblings, 2 replies; 14+ messages in thread
From: Daniel Lezcano @ 2011-12-04 20:24 UTC (permalink / raw)
To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
oleg-H+wXaHxf7aLQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA
ChangeLog:
* V3
- removed lock and serialization of pid_ns_reboot
* V2
- added a lock for the pid namespace to prevent racy call
to the 'reboot' syscall
- Moved 'reboot' command assigned in zap_pid_ns_processes
instead of wait_task_zombie
- added tasklist lock around force_sig
- added do_exit in pid_ns_reboot
- used task_active_pid_ns instead of declaring a new variable in sys_reboot
- moved code up before POWER_OFF changed to HALT in sys_reboot
Daniel Lezcano (1):
Add reboot_pid_ns to handle the reboot syscall
include/linux/pid_namespace.h | 8 +++++++-
kernel/pid_namespace.c | 33 +++++++++++++++++++++++++++++++++
kernel/sys.c | 3 +++
3 files changed, 43 insertions(+), 1 deletions(-)
--
1.7.5.4
^ permalink raw reply [flat|nested] 14+ messages in thread[parent not found: <1323030290-22216-1-git-send-email-daniel.lezcano-GANU6spQydw@public.gmane.org>]
* [PATCH 1/1][V3] Add reboot_pid_ns to handle the reboot syscall [not found] ` <1323030290-22216-1-git-send-email-daniel.lezcano-GANU6spQydw@public.gmane.org> @ 2011-12-04 20:24 ` Daniel Lezcano [not found] ` <1323030290-22216-2-git-send-email-daniel.lezcano-GANU6spQydw@public.gmane.org> 2011-12-04 21:27 ` [PATCH 0/1][V3] Handle reboot in a child pid namespace Henrique de Moraes Holschuh 1 sibling, 1 reply; 14+ messages in thread From: Daniel Lezcano @ 2011-12-04 20:24 UTC (permalink / raw) To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, oleg-H+wXaHxf7aLQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA In the case of a child pid namespace, rebooting the system does not really makes sense. When the pid namespace is used in conjunction with the other namespaces in order to create a linux container, the reboot syscall leads to some problems. A container can reboot the host. That can be fixed by dropping the sys_reboot capability but we are unable to correctly to poweroff/ halt/reboot a container and the container stays stuck at the shutdown time with the container's init process waiting indefinitively. After several attempts, no solution from userspace was found to reliabily handle the shutdown from a container. This patch propose to store the reboot value in the 16 upper bits of the exit code from the processes belonging to a pid namespace which has rebooted. When the reboot syscall is called and we are not in the initial pid namespace, we kill the pid namespace. By this way the parent process of the child pid namespace to know if it rebooted or not and take the right decision. Signed-off-by: Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> --- include/linux/pid_namespace.h | 8 +++++++- kernel/pid_namespace.c | 33 +++++++++++++++++++++++++++++++++ kernel/sys.c | 3 +++ 3 files changed, 43 insertions(+), 1 deletions(-) diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h index e7cf666..3279596 100644 --- a/include/linux/pid_namespace.h +++ b/include/linux/pid_namespace.h @@ -32,6 +32,7 @@ struct pid_namespace { #endif gid_t pid_gid; int hide_pid; + int reboot; }; extern struct pid_namespace init_pid_ns; @@ -47,6 +48,7 @@ static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns) extern struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *ns); extern void free_pid_ns(struct kref *kref); extern void zap_pid_ns_processes(struct pid_namespace *pid_ns); +extern int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd); static inline void put_pid_ns(struct pid_namespace *ns) { @@ -74,11 +76,15 @@ static inline void put_pid_ns(struct pid_namespace *ns) { } - static inline void zap_pid_ns_processes(struct pid_namespace *ns) { BUG(); } + +static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd) +{ + BUG(); +} #endif /* CONFIG_PID_NS */ extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk); diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index a896839..c7a85ea 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -15,6 +15,7 @@ #include <linux/acct.h> #include <linux/slab.h> #include <linux/proc_fs.h> +#include <linux/reboot.h> #define BITS_PER_PAGE (PAGE_SIZE*8) @@ -187,6 +188,9 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) rc = sys_wait4(-1, NULL, __WALL, NULL); } while (rc != -ECHILD); + if (pid_ns->reboot) + current->signal->group_exit_code = pid_ns->reboot; + acct_exit_ns(pid_ns); return; } @@ -221,6 +225,35 @@ static struct ctl_table pid_ns_ctl_table[] = { static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } }; +int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd) +{ + switch(cmd) { + case LINUX_REBOOT_CMD_RESTART2: + case LINUX_REBOOT_CMD_RESTART: + pid_ns->reboot = SYSTEM_RESTART << 16; + break; + + case LINUX_REBOOT_CMD_HALT: + pid_ns->reboot = SYSTEM_HALT << 16; + break; + + case LINUX_REBOOT_CMD_POWER_OFF: + pid_ns->reboot = SYSTEM_POWER_OFF << 16; + break; + default: + return -EINVAL; + } + + read_lock(&tasklist_lock); + force_sig(SIGKILL, pid_ns->child_reaper); + read_unlock(&tasklist_lock); + + do_exit(0); + + /* Not reached */ + return 0; +} + static __init int pid_namespaces_init(void) { pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC); diff --git a/kernel/sys.c b/kernel/sys.c index ddf8155..31acf63 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -444,6 +444,9 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd, magic2 != LINUX_REBOOT_MAGIC2C)) return -EINVAL; + if (task_active_pid_ns(current) != &init_pid_ns) + return reboot_pid_ns(task_active_pid_ns(current), cmd); + /* Instead of trying to make the power_off code look like * halt when pm_power_off is not set do it the easy way. */ -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 14+ messages in thread
[parent not found: <1323030290-22216-2-git-send-email-daniel.lezcano-GANU6spQydw@public.gmane.org>]
* Re: [PATCH 1/1][V3] Add reboot_pid_ns to handle the reboot syscall [not found] ` <1323030290-22216-2-git-send-email-daniel.lezcano-GANU6spQydw@public.gmane.org> @ 2011-12-05 18:35 ` Serge Hallyn 2011-12-05 20:42 ` Oleg Nesterov 2011-12-07 1:16 ` Andrew Morton 2 siblings, 0 replies; 14+ messages in thread From: Serge Hallyn @ 2011-12-05 18:35 UTC (permalink / raw) To: Daniel Lezcano Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, oleg-H+wXaHxf7aLQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On 12/04/2011 02:24 PM, Daniel Lezcano wrote: > In the case of a child pid namespace, rebooting the system does not > really makes sense. When the pid namespace is used in conjunction > with the other namespaces in order to create a linux container, the > reboot syscall leads to some problems. > > A container can reboot the host. That can be fixed by dropping > the sys_reboot capability but we are unable to correctly to poweroff/ > halt/reboot a container and the container stays stuck at the shutdown > time with the container's init process waiting indefinitively. > > After several attempts, no solution from userspace was found to reliabily > handle the shutdown from a container. > > This patch propose to store the reboot value in the 16 upper bits of the > exit code from the processes belonging to a pid namespace which has > rebooted. When the reboot syscall is called and we are not in the initial > pid namespace, we kill the pid namespace. > > By this way the parent process of the child pid namespace to know if > it rebooted or not and take the right decision. > > Signed-off-by: Daniel Lezcano<daniel.lezcano-GANU6spQydw@public.gmane.org> Tested-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> Tested it with reboot(2), worked as expected. thanks, -serge > Acked-by: Serge Hallyn<serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> > --- > include/linux/pid_namespace.h | 8 +++++++- > kernel/pid_namespace.c | 33 +++++++++++++++++++++++++++++++++ > kernel/sys.c | 3 +++ > 3 files changed, 43 insertions(+), 1 deletions(-) > > diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h > index e7cf666..3279596 100644 > --- a/include/linux/pid_namespace.h > +++ b/include/linux/pid_namespace.h > @@ -32,6 +32,7 @@ struct pid_namespace { > #endif > gid_t pid_gid; > int hide_pid; > + int reboot; > }; > > extern struct pid_namespace init_pid_ns; > @@ -47,6 +48,7 @@ static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns) > extern struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *ns); > extern void free_pid_ns(struct kref *kref); > extern void zap_pid_ns_processes(struct pid_namespace *pid_ns); > +extern int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd); > > static inline void put_pid_ns(struct pid_namespace *ns) > { > @@ -74,11 +76,15 @@ static inline void put_pid_ns(struct pid_namespace *ns) > { > } > > - > static inline void zap_pid_ns_processes(struct pid_namespace *ns) > { > BUG(); > } > + > +static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd) > +{ > + BUG(); > +} > #endif /* CONFIG_PID_NS */ > > extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk); > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c > index a896839..c7a85ea 100644 > --- a/kernel/pid_namespace.c > +++ b/kernel/pid_namespace.c > @@ -15,6 +15,7 @@ > #include<linux/acct.h> > #include<linux/slab.h> > #include<linux/proc_fs.h> > +#include<linux/reboot.h> > > #define BITS_PER_PAGE (PAGE_SIZE*8) > > @@ -187,6 +188,9 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) > rc = sys_wait4(-1, NULL, __WALL, NULL); > } while (rc != -ECHILD); > > + if (pid_ns->reboot) > + current->signal->group_exit_code = pid_ns->reboot; > + > acct_exit_ns(pid_ns); > return; > } > @@ -221,6 +225,35 @@ static struct ctl_table pid_ns_ctl_table[] = { > > static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } }; > > +int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd) > +{ > + switch(cmd) { > + case LINUX_REBOOT_CMD_RESTART2: > + case LINUX_REBOOT_CMD_RESTART: > + pid_ns->reboot = SYSTEM_RESTART<< 16; > + break; > + > + case LINUX_REBOOT_CMD_HALT: > + pid_ns->reboot = SYSTEM_HALT<< 16; > + break; > + > + case LINUX_REBOOT_CMD_POWER_OFF: > + pid_ns->reboot = SYSTEM_POWER_OFF<< 16; > + break; > + default: > + return -EINVAL; > + } > + > + read_lock(&tasklist_lock); > + force_sig(SIGKILL, pid_ns->child_reaper); > + read_unlock(&tasklist_lock); > + > + do_exit(0); > + > + /* Not reached */ > + return 0; > +} > + > static __init int pid_namespaces_init(void) > { > pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC); > diff --git a/kernel/sys.c b/kernel/sys.c > index ddf8155..31acf63 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -444,6 +444,9 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd, > magic2 != LINUX_REBOOT_MAGIC2C)) > return -EINVAL; > > + if (task_active_pid_ns(current) !=&init_pid_ns) > + return reboot_pid_ns(task_active_pid_ns(current), cmd); > + > /* Instead of trying to make the power_off code look like > * halt when pm_power_off is not set do it the easy way. > */ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1][V3] Add reboot_pid_ns to handle the reboot syscall [not found] ` <1323030290-22216-2-git-send-email-daniel.lezcano-GANU6spQydw@public.gmane.org> 2011-12-05 18:35 ` Serge Hallyn @ 2011-12-05 20:42 ` Oleg Nesterov [not found] ` <20111205204238.GA7422-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2011-12-07 1:16 ` Andrew Morton 2 siblings, 1 reply; 14+ messages in thread From: Oleg Nesterov @ 2011-12-05 20:42 UTC (permalink / raw) To: Daniel Lezcano Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On 12/04, Daniel Lezcano wrote: > > This patch propose to store the reboot value in the 16 upper bits of the > exit code from the processes belonging to a pid namespace which has > rebooted. When the reboot syscall is called and we are not in the initial > pid namespace, we kill the pid namespace. > > By this way the parent process of the child pid namespace to know if > it rebooted or not and take the right decision. > > Signed-off-by: Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> > Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> > --- > include/linux/pid_namespace.h | 8 +++++++- > kernel/pid_namespace.c | 33 +++++++++++++++++++++++++++++++++ > kernel/sys.c | 3 +++ > 3 files changed, 43 insertions(+), 1 deletions(-) Signed-off-by: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h > index e7cf666..3279596 100644 > --- a/include/linux/pid_namespace.h > +++ b/include/linux/pid_namespace.h > @@ -32,6 +32,7 @@ struct pid_namespace { > #endif > gid_t pid_gid; > int hide_pid; > + int reboot; > }; > > extern struct pid_namespace init_pid_ns; > @@ -47,6 +48,7 @@ static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns) > extern struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *ns); > extern void free_pid_ns(struct kref *kref); > extern void zap_pid_ns_processes(struct pid_namespace *pid_ns); > +extern int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd); > > static inline void put_pid_ns(struct pid_namespace *ns) > { > @@ -74,11 +76,15 @@ static inline void put_pid_ns(struct pid_namespace *ns) > { > } > > - > static inline void zap_pid_ns_processes(struct pid_namespace *ns) > { > BUG(); > } > + > +static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd) > +{ > + BUG(); > +} > #endif /* CONFIG_PID_NS */ > > extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk); > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c > index a896839..c7a85ea 100644 > --- a/kernel/pid_namespace.c > +++ b/kernel/pid_namespace.c > @@ -15,6 +15,7 @@ > #include <linux/acct.h> > #include <linux/slab.h> > #include <linux/proc_fs.h> > +#include <linux/reboot.h> > > #define BITS_PER_PAGE (PAGE_SIZE*8) > > @@ -187,6 +188,9 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) > rc = sys_wait4(-1, NULL, __WALL, NULL); > } while (rc != -ECHILD); > > + if (pid_ns->reboot) > + current->signal->group_exit_code = pid_ns->reboot; > + > acct_exit_ns(pid_ns); > return; > } > @@ -221,6 +225,35 @@ static struct ctl_table pid_ns_ctl_table[] = { > > static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } }; > > +int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd) > +{ > + switch(cmd) { > + case LINUX_REBOOT_CMD_RESTART2: > + case LINUX_REBOOT_CMD_RESTART: > + pid_ns->reboot = SYSTEM_RESTART << 16; > + break; > + > + case LINUX_REBOOT_CMD_HALT: > + pid_ns->reboot = SYSTEM_HALT << 16; > + break; > + > + case LINUX_REBOOT_CMD_POWER_OFF: > + pid_ns->reboot = SYSTEM_POWER_OFF << 16; > + break; > + default: > + return -EINVAL; > + } > + > + read_lock(&tasklist_lock); > + force_sig(SIGKILL, pid_ns->child_reaper); > + read_unlock(&tasklist_lock); > + > + do_exit(0); > + > + /* Not reached */ > + return 0; > +} > + > static __init int pid_namespaces_init(void) > { > pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC); > diff --git a/kernel/sys.c b/kernel/sys.c > index ddf8155..31acf63 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -444,6 +444,9 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd, > magic2 != LINUX_REBOOT_MAGIC2C)) > return -EINVAL; > > + if (task_active_pid_ns(current) != &init_pid_ns) > + return reboot_pid_ns(task_active_pid_ns(current), cmd); > + > /* Instead of trying to make the power_off code look like > * halt when pm_power_off is not set do it the easy way. > */ > -- > 1.7.5.4 > ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <20111205204238.GA7422-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 1/1][V3] Add reboot_pid_ns to handle the reboot syscall [not found] ` <20111205204238.GA7422-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2011-12-05 21:16 ` Daniel Lezcano 2011-12-05 21:17 ` Daniel Lezcano 1 sibling, 0 replies; 14+ messages in thread From: Daniel Lezcano @ 2011-12-05 21:16 UTC (permalink / raw) To: Oleg Nesterov Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On 12/05/2011 09:42 PM, Oleg Nesterov wrote: > On 12/04, Daniel Lezcano wrote: >> This patch propose to store the reboot value in the 16 upper bits of the >> exit code from the processes belonging to a pid namespace which has >> rebooted. When the reboot syscall is called and we are not in the initial >> pid namespace, we kill the pid namespace. >> >> By this way the parent process of the child pid namespace to know if >> it rebooted or not and take the right decision. >> >> Signed-off-by: Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> >> Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> >> --- >> include/linux/pid_namespace.h | 8 +++++++- >> kernel/pid_namespace.c | 33 +++++++++++++++++++++++++++++++++ >> kernel/sys.c | 3 +++ >> 3 files changed, 43 insertions(+), 1 deletions(-) > Signed-off-by: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Looks like our emails crossed by the wire :) Thanks again Oleg for your comments and your help. -- Daniel ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1][V3] Add reboot_pid_ns to handle the reboot syscall [not found] ` <20111205204238.GA7422-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2011-12-05 21:16 ` Daniel Lezcano @ 2011-12-05 21:17 ` Daniel Lezcano 1 sibling, 0 replies; 14+ messages in thread From: Daniel Lezcano @ 2011-12-05 21:17 UTC (permalink / raw) To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA On 12/05/2011 09:42 PM, Oleg Nesterov wrote: > On 12/04, Daniel Lezcano wrote: >> This patch propose to store the reboot value in the 16 upper bits of the >> exit code from the processes belonging to a pid namespace which has >> rebooted. When the reboot syscall is called and we are not in the initial >> pid namespace, we kill the pid namespace. >> >> By this way the parent process of the child pid namespace to know if >> it rebooted or not and take the right decision. >> >> Signed-off-by: Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> >> Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> >> --- >> include/linux/pid_namespace.h | 8 +++++++- >> kernel/pid_namespace.c | 33 +++++++++++++++++++++++++++++++++ >> kernel/sys.c | 3 +++ >> 3 files changed, 43 insertions(+), 1 deletions(-) > Signed-off-by: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Andrew, do you think it's possible to take this patch upstream ? Thanks -- Daniel ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1][V3] Add reboot_pid_ns to handle the reboot syscall [not found] ` <1323030290-22216-2-git-send-email-daniel.lezcano-GANU6spQydw@public.gmane.org> 2011-12-05 18:35 ` Serge Hallyn 2011-12-05 20:42 ` Oleg Nesterov @ 2011-12-07 1:16 ` Andrew Morton [not found] ` <20111206171617.e31bc3a6.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> 2 siblings, 1 reply; 14+ messages in thread From: Andrew Morton @ 2011-12-07 1:16 UTC (permalink / raw) To: Daniel Lezcano Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, oleg-H+wXaHxf7aLQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk, Ulrich Drepper On Sun, 4 Dec 2011 21:24:50 +0100 Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> wrote: > In the case of a child pid namespace, rebooting the system does not > really makes sense. When the pid namespace is used in conjunction > with the other namespaces in order to create a linux container, the > reboot syscall leads to some problems. > > A container can reboot the host. That can be fixed by dropping > the sys_reboot capability but we are unable to correctly to poweroff/ > halt/reboot a container and the container stays stuck at the shutdown > time with the container's init process waiting indefinitively. > > After several attempts, no solution from userspace was found to reliabily > handle the shutdown from a container. > > This patch propose to store the reboot value in the 16 upper bits of the > exit code from the processes belonging to a pid namespace which has > rebooted. When the reboot syscall is called and we are not in the initial > pid namespace, we kill the pid namespace. > > By this way the parent process of the child pid namespace to know if > it rebooted or not and take the right decision. hm, modifying the exit code in this manner is a strange interface. I didn't see that coming. Perhaps some additional justification for this idea should be added to the changelog, along with discussion of alternative schemes. I don't immediately see any problems with it, but, odd... I wonder what potential it has to upset existing userspace. Also, do we need to reserve all upper 16 bits of the exit code? It would be better to formally leave some free for future use. Also, this alters the interface of wait4() and friends, yes? If so, that should be documented in the manpages. Please Cc Michael on such changes. Also, this affects the data delivered by taskstats, I believe. Please check this, test it, document it in the changelog and update getdelays.c appropriately. Also, glibc might be affected. For symmetry we might want to add a WIFREBOOT() or something. And we now expect waitid() to fill in extra bits in siginfo_t.si_status, which assumes that glibc (and other libc's!) aren't using a u8 in there somewhere. etcetera. This all should be tested, and reviewed by Uli (please). > > ... > > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -444,6 +444,9 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd, > magic2 != LINUX_REBOOT_MAGIC2C)) > return -EINVAL; > > + if (task_active_pid_ns(current) != &init_pid_ns) > + return reboot_pid_ns(task_active_pid_ns(current), cmd); This adds a bunch of useless code if CONFIG_PID_NS=n. It would be better to do #ifdef CONFIG_PID_NS extern void pidns_handle_reboot(int cmd); #else static inline void pidns_handle_reboot(int cmd) { } #endif ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <20111206171617.e31bc3a6.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]
* Re: [PATCH 1/1][V3] Add reboot_pid_ns to handle the reboot syscall [not found] ` <20111206171617.e31bc3a6.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> @ 2011-12-07 15:12 ` Oleg Nesterov 2011-12-07 21:36 ` Daniel Lezcano 1 sibling, 0 replies; 14+ messages in thread From: Oleg Nesterov @ 2011-12-07 15:12 UTC (permalink / raw) To: Andrew Morton Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk, Ulrich Drepper On 12/06, Andrew Morton wrote: > On Sun, 4 Dec 2011 21:24:50 +0100 > Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> wrote: > > > This patch propose to store the reboot value in the 16 upper bits of the > > exit code from the processes belonging to a pid namespace which has > > rebooted. When the reboot syscall is called and we are not in the initial > > pid namespace, we kill the pid namespace. > > > > By this way the parent process of the child pid namespace to know if > > it rebooted or not and take the right decision. > > hm, modifying the exit code in this manner is a strange interface. I > didn't see that coming. Perhaps some additional justification for this > idea should be added to the changelog, along with discussion of > alternative schemes. I don't immediately see any problems with it, > but, odd... I wonder what potential it has to upset existing > userspace. Alternatively, we could do something like switch (reboot) { case LINUX_REBOOT_CMD_RESTART: exit_code = SIGHUP; break; case LINUX_REBOOT_CMD_HALT: exit_code = SIGINT; break; ... } this way the parent can check WIFSIGNALED/WTERMSIG instead of upper bits. This was the initial suggestion, and personally I like this more. But I do not think this can upset existing userspace. __WEXITSTATUS() reports the lower bits only, it can't see the extra info we add. > Also, this affects the data delivered by taskstats, I believe. Please > check this, test it, document it in the changelog and update > getdelays.c appropriately. No, taskstats report ->exit_code. This doesn't look right btw. But in any case I do not think this can break something. > Also, glibc might be affected. For symmetry we might want to add a > WIFREBOOT() or something. We already use these upper bits to report the ptrace events, in the same manner. I do not think this has something to do with libc. Although it could probably have another macro to read this info. > And we now expect waitid() to fill in extra > bits in siginfo_t.si_status, which assumes that glibc (and other > libc's!) aren't using a u8 in there somewhere. etcetera. This all > should be tested, and reviewed by Uli (please). Again, ptrace already puts the extra info this way when the tracee stops. Oleg. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1][V3] Add reboot_pid_ns to handle the reboot syscall [not found] ` <20111206171617.e31bc3a6.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> 2011-12-07 15:12 ` Oleg Nesterov @ 2011-12-07 21:36 ` Daniel Lezcano 1 sibling, 0 replies; 14+ messages in thread From: Daniel Lezcano @ 2011-12-07 21:36 UTC (permalink / raw) To: Andrew Morton Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, oleg-H+wXaHxf7aLQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk, Ulrich Drepper On 12/07/2011 02:16 AM, Andrew Morton wrote: > On Sun, 4 Dec 2011 21:24:50 +0100 > Daniel Lezcano <daniel.lezcano-GANU6spQydw@public.gmane.org> wrote: > >> In the case of a child pid namespace, rebooting the system does not >> really makes sense. When the pid namespace is used in conjunction >> with the other namespaces in order to create a linux container, the >> reboot syscall leads to some problems. >> >> A container can reboot the host. That can be fixed by dropping >> the sys_reboot capability but we are unable to correctly to poweroff/ >> halt/reboot a container and the container stays stuck at the shutdown >> time with the container's init process waiting indefinitively. >> >> After several attempts, no solution from userspace was found to reliabily >> handle the shutdown from a container. >> >> This patch propose to store the reboot value in the 16 upper bits of the >> exit code from the processes belonging to a pid namespace which has >> rebooted. When the reboot syscall is called and we are not in the initial >> pid namespace, we kill the pid namespace. >> >> By this way the parent process of the child pid namespace to know if >> it rebooted or not and take the right decision. > hm, modifying the exit code in this manner is a strange interface. I > didn't see that coming. Perhaps some additional justification for this > idea should be added to the changelog, along with discussion of > alternative schemes. I don't immediately see any problems with it, > but, odd... I wonder what potential it has to upset existing > userspace. At the first glance, that shouldn't interact with the userspace application. This happens only when waiting the init process of a child pid namespace if one process of this namespace call 'reboot'. I don't have the pretension to know all the applications of the world but this case should be very rare expect for the containers implementations, lxc and libvirt's LXC, which need this feature. Well written applications should use WIFEXITED, WIFSIGNALED, ... These macros mask the upper bits of the exit code. IMHO, the applications shouldn't be impacted by this modification. > Also, do we need to reserve all upper 16 bits of the exit code? It > would be better to formally leave some free for future use. Yes, that's right, 4 bits should be enough. > Also, this alters the interface of wait4() and friends, yes? If so, > that should be documented in the manpages. Please Cc Michael on such > changes. Sure, I will add a more precise description in the changelog for Michael. > Also, this affects the data delivered by taskstats, I believe. Please > check this, test it, document it in the changelog and update > getdelays.c appropriately. Thanks for spotting this. I will check. > Also, glibc might be affected. For symmetry we might want to add a > WIFREBOOT() or something. And we now expect waitid() to fill in extra > bits in siginfo_t.si_status, which assumes that glibc (and other > libc's!) aren't using a u8 in there somewhere. etcetera. This all > should be tested, and reviewed by Uli (please). Ok, I will add some test cases to the the patch 0/1 with the macro definition and in addition a better description. Thanks for your comments. -- Daniel >> ... >> >> --- a/kernel/sys.c >> +++ b/kernel/sys.c >> @@ -444,6 +444,9 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd, >> magic2 != LINUX_REBOOT_MAGIC2C)) >> return -EINVAL; >> >> + if (task_active_pid_ns(current) != &init_pid_ns) >> + return reboot_pid_ns(task_active_pid_ns(current), cmd); > This adds a bunch of useless code if CONFIG_PID_NS=n. It would be > better to do > > #ifdef CONFIG_PID_NS > extern void pidns_handle_reboot(int cmd); > #else > static inline void pidns_handle_reboot(int cmd) > { > } > #endif > > > * Unknown - detected * English * French * English * French <javascript:void(0);><#> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/1][V3] Handle reboot in a child pid namespace [not found] ` <1323030290-22216-1-git-send-email-daniel.lezcano-GANU6spQydw@public.gmane.org> 2011-12-04 20:24 ` [PATCH 1/1][V3] Add reboot_pid_ns to handle the reboot syscall Daniel Lezcano @ 2011-12-04 21:27 ` Henrique de Moraes Holschuh 1 sibling, 0 replies; 14+ messages in thread From: Henrique de Moraes Holschuh @ 2011-12-04 21:27 UTC (permalink / raw) To: Daniel Lezcano Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, oleg-H+wXaHxf7aLQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Sun, 04 Dec 2011, Daniel Lezcano wrote: > * V3 > - removed lock and serialization of pid_ns_reboot > * V2 > - added a lock for the pid namespace to prevent racy call > to the 'reboot' syscall > - Moved 'reboot' command assigned in zap_pid_ns_processes > instead of wait_task_zombie > - added tasklist lock around force_sig > - added do_exit in pid_ns_reboot > - used task_active_pid_ns instead of declaring a new variable in sys_reboot > - moved code up before POWER_OFF changed to HALT in sys_reboot Daniel, can you address Miquel's concern? Is it a valid concern, or not? I assume CAP_REBOOT functionality is still in place inside the container, so it really does look like userspace would need to know whether it should drop CAP_REBOOT or not, in order to automatically use the new feature. -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <20111204212756.GB16362@khazad-dum.debian.net>]
[parent not found: <20111204212756.GB16362-ZGHd14iZgfaRjzvQDGKj+xxZW9W5cXbT@public.gmane.org>]
* Re: [PATCH 0/1][V3] Handle reboot in a child pid namespace [not found] ` <20111204212756.GB16362-ZGHd14iZgfaRjzvQDGKj+xxZW9W5cXbT@public.gmane.org> @ 2011-12-04 23:08 ` Daniel Lezcano [not found] ` <4EDBFD67.1040009-GANU6spQydw@public.gmane.org> 0 siblings, 1 reply; 14+ messages in thread From: Daniel Lezcano @ 2011-12-04 23:08 UTC (permalink / raw) To: Henrique de Moraes Holschuh Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, oleg-H+wXaHxf7aLQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On 12/04/2011 10:27 PM, Henrique de Moraes Holschuh wrote: > On Sun, 04 Dec 2011, Daniel Lezcano wrote: >> * V3 >> - removed lock and serialization of pid_ns_reboot >> * V2 >> - added a lock for the pid namespace to prevent racy call >> to the 'reboot' syscall >> - Moved 'reboot' command assigned in zap_pid_ns_processes >> instead of wait_task_zombie >> - added tasklist lock around force_sig >> - added do_exit in pid_ns_reboot >> - used task_active_pid_ns instead of declaring a new variable in sys_reboot >> - moved code up before POWER_OFF changed to HALT in sys_reboot > Daniel, can you address Miquel's concern? Is it a valid concern, or > not? I assume CAP_REBOOT functionality is still in place inside the > container, so it really does look like userspace would need to know > whether it should drop CAP_REBOOT or not, in order to automatically use > the new feature. Hmm, I missed its email. I think it is worth to have such ability to detect how behaves the reboot syscall vs the pid ns. At present, if we call 'reboot' in a child pid namespace, that will affect the host, we are changing this behavior with this patch. I don't think there is any application doing a shutdown from a child pid namespace, that don't makes sense as the shutdown is invoked after killing all the processes on the system and that could only be done from the init_pid_ns. I would like to address this in a separate patch in order to discuss the best way to do that. Adding a fake 'reboot' parameter returning EINVAL or 0 seems a good solution to detect at runtime if the shutdown is correctly supported inside a container. ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <4EDBFD67.1040009-GANU6spQydw@public.gmane.org>]
* Re: [PATCH 0/1][V3] Handle reboot in a child pid namespace [not found] ` <4EDBFD67.1040009-GANU6spQydw@public.gmane.org> @ 2011-12-05 20:49 ` Daniel Lezcano [not found] ` <4EDD2E5C.1050107-GANU6spQydw@public.gmane.org> 2011-12-05 20:50 ` Oleg Nesterov 1 sibling, 1 reply; 14+ messages in thread From: Daniel Lezcano @ 2011-12-05 20:49 UTC (permalink / raw) To: oleg-H+wXaHxf7aLQT0dZR+AlfA Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Henrique de Moraes Holschuh On 12/05/2011 12:08 AM, Daniel Lezcano wrote: > On 12/04/2011 10:27 PM, Henrique de Moraes Holschuh wrote: >> On Sun, 04 Dec 2011, Daniel Lezcano wrote: >>> * V3 >>> - removed lock and serialization of pid_ns_reboot >>> * V2 >>> - added a lock for the pid namespace to prevent racy call >>> to the 'reboot' syscall >>> - Moved 'reboot' command assigned in zap_pid_ns_processes >>> instead of wait_task_zombie >>> - added tasklist lock around force_sig >>> - added do_exit in pid_ns_reboot >>> - used task_active_pid_ns instead of declaring a new variable in sys_reboot >>> - moved code up before POWER_OFF changed to HALT in sys_reboot Oleg ? Are you fine with this patch ? ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <4EDD2E5C.1050107-GANU6spQydw@public.gmane.org>]
* Re: [PATCH 0/1][V3] Handle reboot in a child pid namespace [not found] ` <4EDD2E5C.1050107-GANU6spQydw@public.gmane.org> @ 2011-12-05 20:51 ` Oleg Nesterov 0 siblings, 0 replies; 14+ messages in thread From: Oleg Nesterov @ 2011-12-05 20:51 UTC (permalink / raw) To: Daniel Lezcano Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Henrique de Moraes Holschuh On 12/05, Daniel Lezcano wrote: > > On 12/05/2011 12:08 AM, Daniel Lezcano wrote: > > On 12/04/2011 10:27 PM, Henrique de Moraes Holschuh wrote: > >> On Sun, 04 Dec 2011, Daniel Lezcano wrote: > >>> * V3 > >>> - removed lock and serialization of pid_ns_reboot > >>> * V2 > >>> - added a lock for the pid namespace to prevent racy call > >>> to the 'reboot' syscall > >>> - Moved 'reboot' command assigned in zap_pid_ns_processes > >>> instead of wait_task_zombie > >>> - added tasklist lock around force_sig > >>> - added do_exit in pid_ns_reboot > >>> - used task_active_pid_ns instead of declaring a new variable in sys_reboot > >>> - moved code up before POWER_OFF changed to HALT in sys_reboot > > Oleg ? Are you fine with this patch ? I already sent my reviwed-by ;) Oleg. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/1][V3] Handle reboot in a child pid namespace [not found] ` <4EDBFD67.1040009-GANU6spQydw@public.gmane.org> 2011-12-05 20:49 ` Daniel Lezcano @ 2011-12-05 20:50 ` Oleg Nesterov 1 sibling, 0 replies; 14+ messages in thread From: Oleg Nesterov @ 2011-12-05 20:50 UTC (permalink / raw) To: Daniel Lezcano Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Henrique de Moraes Holschuh, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On 12/05, Daniel Lezcano wrote: > > On 12/04/2011 10:27 PM, Henrique de Moraes Holschuh wrote: > > On Sun, 04 Dec 2011, Daniel Lezcano wrote: > > Daniel, can you address Miquel's concern? Is it a valid concern, or > > not? I assume CAP_REBOOT functionality is still in place inside the > > container, so it really does look like userspace would need to know > > whether it should drop CAP_REBOOT or not, in order to automatically use > > the new feature. > > Hmm, I missed its email. Me too... so I am not sure I really understand the problem. > I would like to address this in a separate patch in order to discuss the > best way to do that. Agreed. > Adding a fake 'reboot' parameter returning EINVAL > or 0 seems a good solution to detect at runtime if the shutdown is > correctly supported inside a container. Or, perhaps, we can implement sys_reboot(REBOOT_SHOULD_NOT_WORK), sub-init can call it to disable the shutdown ? This needs the trivial modifications in zap_pid_ns_processes() and reboot_pid_ns(). Oleg. ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2011-12-07 21:36 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-04 20:24 [PATCH 0/1][V3] Handle reboot in a child pid namespace Daniel Lezcano
[not found] ` <1323030290-22216-1-git-send-email-daniel.lezcano-GANU6spQydw@public.gmane.org>
2011-12-04 20:24 ` [PATCH 1/1][V3] Add reboot_pid_ns to handle the reboot syscall Daniel Lezcano
[not found] ` <1323030290-22216-2-git-send-email-daniel.lezcano-GANU6spQydw@public.gmane.org>
2011-12-05 18:35 ` Serge Hallyn
2011-12-05 20:42 ` Oleg Nesterov
[not found] ` <20111205204238.GA7422-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2011-12-05 21:16 ` Daniel Lezcano
2011-12-05 21:17 ` Daniel Lezcano
2011-12-07 1:16 ` Andrew Morton
[not found] ` <20111206171617.e31bc3a6.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2011-12-07 15:12 ` Oleg Nesterov
2011-12-07 21:36 ` Daniel Lezcano
2011-12-04 21:27 ` [PATCH 0/1][V3] Handle reboot in a child pid namespace Henrique de Moraes Holschuh
[not found] ` <20111204212756.GB16362@khazad-dum.debian.net>
[not found] ` <20111204212756.GB16362-ZGHd14iZgfaRjzvQDGKj+xxZW9W5cXbT@public.gmane.org>
2011-12-04 23:08 ` Daniel Lezcano
[not found] ` <4EDBFD67.1040009-GANU6spQydw@public.gmane.org>
2011-12-05 20:49 ` Daniel Lezcano
[not found] ` <4EDD2E5C.1050107-GANU6spQydw@public.gmane.org>
2011-12-05 20:51 ` Oleg Nesterov
2011-12-05 20:50 ` Oleg Nesterov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox