* [PATCH 0/1][V2] Handle reboot in a child pid namespace
@ 2011-12-03 23:51 Daniel Lezcano
2011-12-03 23:51 ` [PATCH 1/1][v2] Add reboot_pid_ns to handle the reboot syscall Daniel Lezcano
0 siblings, 1 reply; 5+ messages in thread
From: Daniel Lezcano @ 2011-12-03 23:51 UTC (permalink / raw)
To: akpm; +Cc: serge.hallyn, oleg, containers, gkurz, linux-kernel
ChangeLog:
* V2
- added a lock for the pid namespace to prevent racy call
to the 'reboot' syscall
- Moved 'reboot' command assigned in zap_pid_ns_processes
instead of wait_task_zombie
- added tasklist lock around force_sig
- added do_exit in pid_ns_reboot
- used task_active_pid_ns instead of declaring a new variable in sys_reboot
- moved code up before POWER_OFF changed to HALT in sys_reboot
Daniel Lezcano (1):
Add reboot_pid_ns to handle the reboot syscall
include/linux/pid_namespace.h | 9 ++++++-
kernel/pid_namespace.c | 54 +++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 3 ++
3 files changed, 65 insertions(+), 1 deletions(-)
--
1.7.5.4
^ permalink raw reply [flat|nested] 5+ messages in thread* [PATCH 1/1][v2] Add reboot_pid_ns to handle the reboot syscall 2011-12-03 23:51 [PATCH 0/1][V2] Handle reboot in a child pid namespace Daniel Lezcano @ 2011-12-03 23:51 ` Daniel Lezcano 2011-12-04 15:45 ` Oleg Nesterov 2011-12-04 15:58 ` Miquel van Smoorenburg 0 siblings, 2 replies; 5+ messages in thread From: Daniel Lezcano @ 2011-12-03 23:51 UTC (permalink / raw) To: akpm; +Cc: serge.hallyn, oleg, containers, gkurz, linux-kernel In the case of a child pid namespace, rebooting the system does not really makes sense. When the pid namespace is used in conjunction with the other namespaces in order to create a linux container, the reboot syscall leads to some problems. A container can reboot the host. That can be fixed by dropping the sys_reboot capability but we are unable to correctly to poweroff/ halt/reboot a container and the container stays stuck at the shutdown time with the container's init process waiting indefinitively. After several attempts, no solution from userspace was found to reliabily handle the shutdown from a container. This patch propose to store the reboot value in the 16 upper bits of the exit code from the processes belonging to a pid namespace which has rebooted. When the reboot syscall is called and we are not in the initial pid namespace, we kill the pid namespace. By this way the parent process of the child pid namespace to know if it rebooted or not and take the right decision. Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> --- include/linux/pid_namespace.h | 9 ++++++- kernel/pid_namespace.c | 54 +++++++++++++++++++++++++++++++++++++++++ kernel/sys.c | 3 ++ 3 files changed, 65 insertions(+), 1 deletions(-) diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h index e7cf666..f5f1f60 100644 --- a/include/linux/pid_namespace.h +++ b/include/linux/pid_namespace.h @@ -32,6 +32,8 @@ struct pid_namespace { #endif gid_t pid_gid; int hide_pid; + int reboot; + spinlock_t reboot_lock; }; extern struct pid_namespace init_pid_ns; @@ -47,6 +49,7 @@ static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns) extern struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *ns); extern void free_pid_ns(struct kref *kref); extern void zap_pid_ns_processes(struct pid_namespace *pid_ns); +extern int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd); static inline void put_pid_ns(struct pid_namespace *ns) { @@ -74,11 +77,15 @@ static inline void put_pid_ns(struct pid_namespace *ns) { } - static inline void zap_pid_ns_processes(struct pid_namespace *ns) { BUG(); } + +static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd) +{ + BUG(); +} #endif /* CONFIG_PID_NS */ extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk); diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index a896839..a1fe60c 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -15,6 +15,7 @@ #include <linux/acct.h> #include <linux/slab.h> #include <linux/proc_fs.h> +#include <linux/reboot.h> #define BITS_PER_PAGE (PAGE_SIZE*8) @@ -90,6 +91,7 @@ static struct pid_namespace *create_pid_namespace(struct pid_namespace *parent_p kref_init(&ns->kref); ns->level = level; ns->parent = get_pid_ns(parent_pid_ns); + spin_lock_init(&ns->reboot_lock); set_bit(0, ns->pidmap[0].page); atomic_set(&ns->pidmap[0].nr_free, BITS_PER_PAGE - 1); @@ -187,6 +189,9 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) rc = sys_wait4(-1, NULL, __WALL, NULL); } while (rc != -ECHILD); + if (pid_ns->reboot) + current->signal->group_exit_code = pid_ns->reboot; + acct_exit_ns(pid_ns); return; } @@ -221,6 +226,55 @@ static struct ctl_table pid_ns_ctl_table[] = { static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } }; +static inline int __reboot_pid_ns(struct pid_namespace *pid_ns, int cmd) +{ + /* A reboot already occured, just ignore this request + * all the processes of the pid namespace will be killed + * by the previous 'reboot' call + */ + if (pid_ns->reboot) + return -EBUSY; + + switch(cmd) { + case LINUX_REBOOT_CMD_RESTART2: + case LINUX_REBOOT_CMD_RESTART: + pid_ns->reboot = SYSTEM_RESTART << 16; + break; + + case LINUX_REBOOT_CMD_HALT: + pid_ns->reboot = SYSTEM_HALT << 16; + break; + + case LINUX_REBOOT_CMD_POWER_OFF: + pid_ns->reboot = SYSTEM_POWER_OFF << 16; + break; + default: + return -EINVAL; + } + + return 0; +} + +int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd) +{ + int ret; + + spin_lock(&pid_ns->reboot_lock); + ret = __reboot_pid_ns(pid_ns, cmd); + spin_unlock(&pid_ns->reboot_lock); + if (ret) + goto out; + + read_lock(&tasklist_lock); + force_sig(SIGKILL, pid_ns->child_reaper); + read_unlock(&tasklist_lock); + + do_exit(0); + /* Not reached */ +out: + return ret; +} + static __init int pid_namespaces_init(void) { pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC); diff --git a/kernel/sys.c b/kernel/sys.c index ddf8155..31acf63 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -444,6 +444,9 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd, magic2 != LINUX_REBOOT_MAGIC2C)) return -EINVAL; + if (task_active_pid_ns(current) != &init_pid_ns) + return reboot_pid_ns(task_active_pid_ns(current), cmd); + /* Instead of trying to make the power_off code look like * halt when pm_power_off is not set do it the easy way. */ -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH 1/1][v2] Add reboot_pid_ns to handle the reboot syscall 2011-12-03 23:51 ` [PATCH 1/1][v2] Add reboot_pid_ns to handle the reboot syscall Daniel Lezcano @ 2011-12-04 15:45 ` Oleg Nesterov 2011-12-04 19:25 ` Daniel Lezcano 2011-12-04 15:58 ` Miquel van Smoorenburg 1 sibling, 1 reply; 5+ messages in thread From: Oleg Nesterov @ 2011-12-04 15:45 UTC (permalink / raw) To: Daniel Lezcano; +Cc: akpm, serge.hallyn, containers, gkurz, linux-kernel On 12/04, Daniel Lezcano wrote: > > @@ -32,6 +32,8 @@ struct pid_namespace { > #endif > gid_t pid_gid; > int hide_pid; > + int reboot; > + spinlock_t reboot_lock; > }; Well. I was thinking about the serialization too, but this ->reboot_lock asks for v3 imho ;) First of all, do we really care? force_sig(SIGKILL, child_reaper) can't race with itself, it does nothing if init is already killed. So why it is important to protect pid_ns->reboot? Yes, it is possible to change it again if two callers do sys_reboot() "at the same time". But in this case we can't predict which caller wins anyway, so why should we worry? IOW. Say, we have the 2 tasks doing HALT and RESTART in parallel. It is possible that HALT sets ->reboot and kills init first, then RESTART changes ->reboot and the second force_sig() does nothing. In this case we can simply pretend that RESTART wins and the dying init kills HALT before it calls sys_reboot(). In any case. Even if you want to serialize, instead of adding the new lock reboot_pid_ns() can simply do: if (cmpxchg(&pid_ns->reboot, 0, reboot) != 0) return -EBUSY; this looks much simpler to me. Oleg. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 1/1][v2] Add reboot_pid_ns to handle the reboot syscall 2011-12-04 15:45 ` Oleg Nesterov @ 2011-12-04 19:25 ` Daniel Lezcano 0 siblings, 0 replies; 5+ messages in thread From: Daniel Lezcano @ 2011-12-04 19:25 UTC (permalink / raw) To: Oleg Nesterov; +Cc: akpm, serge.hallyn, containers, gkurz, linux-kernel On 12/04/2011 04:45 PM, Oleg Nesterov wrote: > On 12/04, Daniel Lezcano wrote: >> @@ -32,6 +32,8 @@ struct pid_namespace { >> #endif >> gid_t pid_gid; >> int hide_pid; >> + int reboot; >> + spinlock_t reboot_lock; >> }; > Well. I was thinking about the serialization too, but this > ->reboot_lock asks for v3 imho ;) > > First of all, do we really care? force_sig(SIGKILL, child_reaper) > can't race with itself, it does nothing if init is already killed. > > So why it is important to protect pid_ns->reboot? Yes, it is possible > to change it again if two callers do sys_reboot() "at the same time". > But in this case we can't predict which caller wins anyway, so why > should we worry? > > IOW. Say, we have the 2 tasks doing HALT and RESTART in parallel. > It is possible that HALT sets ->reboot and kills init first, then > RESTART changes ->reboot and the second force_sig() does nothing. > In this case we can simply pretend that RESTART wins and the dying > init kills HALT before it calls sys_reboot(). In the case of racy access, your argument makes sense but it is also to prevent multiple calls to 'reboot'. In the init_pid_ns, when a shutdown is on the way, the lock will prevent another task to invoke a machine restart. But anyway, we can get ride of this lock and the serialization, it is a nit we can fix later if that makes sense with the couple of lines you specified below. > In any case. Even if you want to serialize, instead of adding the > new lock reboot_pid_ns() can simply do: > > if (cmpxchg(&pid_ns->reboot, 0, reboot) != 0) > return -EBUSY; > > this looks much simpler to me. Yes, definitively :) Thanks -- Daniel ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 1/1][v2] Add reboot_pid_ns to handle the reboot syscall 2011-12-03 23:51 ` [PATCH 1/1][v2] Add reboot_pid_ns to handle the reboot syscall Daniel Lezcano 2011-12-04 15:45 ` Oleg Nesterov @ 2011-12-04 15:58 ` Miquel van Smoorenburg 1 sibling, 0 replies; 5+ messages in thread From: Miquel van Smoorenburg @ 2011-12-04 15:58 UTC (permalink / raw) To: daniel.lezcano; +Cc: linux-kernel In article <xs4all.1322956280-13831-2-git-send-email-daniel.lezcano@free.fr> you write: >This patch propose to store the reboot value in the 16 upper bits of the >exit code from the processes belonging to a pid namespace which has >rebooted. When the reboot syscall is called and we are not in the initial >pid namespace, we kill the pid namespace. Should there be a way to find out if the reboot systemcall is containerized or not (other than comparing kernel versions once this is mainlined) ? For example, a tool like lxc-start might want to disable CAP_SYS_REBOOT if the reboot system call is not containerized. Perhaps a LINUX_REBOOT_CMD_ISCONTAINER (or ISVIRTUAL, whatever, bikeshed away) should be added that simply returns 0 if sys_reboot is containerized. It would return -EINVAL as it does today if it's not. Thoughts ? Mike. ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2011-12-04 19:26 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-12-03 23:51 [PATCH 0/1][V2] Handle reboot in a child pid namespace Daniel Lezcano 2011-12-03 23:51 ` [PATCH 1/1][v2] Add reboot_pid_ns to handle the reboot syscall Daniel Lezcano 2011-12-04 15:45 ` Oleg Nesterov 2011-12-04 19:25 ` Daniel Lezcano 2011-12-04 15:58 ` Miquel van Smoorenburg
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox