* [RFC] catching sys_reboot syscall
@ 2011-08-08 21:14 Daniel Lezcano
2011-08-10 20:10 ` Bruno Prémont
2011-08-20 11:03 ` [RFC] catching sys_reboot syscall Pavel Machek
0 siblings, 2 replies; 10+ messages in thread
From: Daniel Lezcano @ 2011-08-08 21:14 UTC (permalink / raw)
To: Linux Kernel Mailing List; +Cc: LXC Development
Hi all,
I am trying to solve one of the problems the linux container community
is facing :
How to catch when a process called the 'reboot' syscall in a container ?
In the case of a VPS, when we shutdown/halt/reboot the container, the
reboot utility will invoke the sys_reboot syscall which has the bad
effect to reboot the host. The way to fix that is to drop the
CAP_SYS_REBOOT capability in the container.
In this case, the container shutdowns correctly but, at the end, the
init process is waiting indefinitely and we have the containers stuck
with one process (the init process).
In order to fix that, we used a hypervisor process, parent of the
container's init process, watching for the container's utmp file and
detecting when the runlevel changes. When this runlevel change is
detected we wait for the container to have one process left and then we
kill the container's init.
That works well if we modify the distro configuration files, we make
/var/run to not be a tmpfs and we remove all the files inside this
directory when the container boots. *But* as soon as we upgrade the
container distro, all the tweaks are lost. So this method works but at
the cost of tweaking the containers configuration files again and again,
each time there is an update, which is not tolerable in a production
environment.
This problem is easy to solve with a small hack in the kernel:
diff --git a/include/asm-generic/siginfo.h b/include/asm-generic/siginfo.h
index 942d30b..61f3d02 100644
--- a/include/asm-generic/siginfo.h
+++ b/include/asm-generic/siginfo.h
@@ -218,7 +218,8 @@ typedef struct siginfo {
#define CLD_TRAPPED (__SI_CHLD|4) /* traced child has trapped */
#define CLD_STOPPED (__SI_CHLD|5) /* child has stopped */
#define CLD_CONTINUED (__SI_CHLD|6) /* stopped child has continued */
-#define NSIGCHLD 6
+#define CLD_REBOOTED (__SI_CHLD|7) /* process was killed by a
reboot */
+#define NSIGCHLD 7
/*
* SIGPOLL si_codes
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 67f5688..41e0889 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2103,6 +2103,7 @@ extern int kill_pgrp(struct pid *pid, int sig, int
priv);
extern int kill_pid(struct pid *pid, int sig, int priv);
extern int kill_proc_info(int, struct siginfo *, pid_t);
extern int do_notify_parent(struct task_struct *, int);
+extern void do_notify_parent_cldreboot(struct task_struct *, int, char *);
extern void __wake_up_parent(struct task_struct *p, struct task_struct
*parent);
extern void force_sig(int, struct task_struct *);
extern int send_sig(int, struct task_struct *, int);
diff --git a/kernel/signal.c b/kernel/signal.c
index 4e3cff1..09fc254 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1522,6 +1522,46 @@ int do_notify_parent(struct task_struct *tsk, int
sig)
return ret;
}
+void do_notify_parent_cldreboot(struct task_struct *tsk, int why, char
*buffer)
+{
+ struct siginfo info = { };
+ struct task_struct *parent;
+ struct sighand_struct *sighand;
+ unsigned long flags;
+
+ if (task_ptrace(tsk))
+ parent = tsk->parent;
+ else {
+ tsk = tsk->group_leader;
+ parent = tsk->real_parent;
+ }
+
+ info.si_signo = SIGCHLD;
+ info.si_errno = 0;
+ info.si_status = why;
+
+ rcu_read_lock();
+ info.si_pid = task_pid_nr_ns(tsk, parent->nsproxy->pid_ns);
+ info.si_uid = __task_cred(tsk)->uid;
+ rcu_read_unlock();
+
+ info.si_utime = cputime_to_clock_t(tsk->utime);
+ info.si_stime = cputime_to_clock_t(tsk->stime);
+
+ info.si_code = CLD_REBOOTED;
+
+ sighand = parent->sighand;
+ spin_lock_irqsave(&sighand->siglock, flags);
+ if (sighand->action[SIGCHLD-1].sa.sa_handler != SIG_IGN &&
+ sighand->action[SIGCHLD-1].sa.sa_flags & SA_CLDREBOOT)
+ __group_send_sig_info(SIGCHLD, &info, parent);
+ /*
+ * Even if SIGCHLD is not generated, we must wake up wait4 calls.
+ */
+ __wake_up_parent(tsk, parent);
+ spin_unlock_irqrestore(&sighand->siglock, flags);
+}
+
static void do_notify_parent_cldstop(struct task_struct *tsk, int why)
{
struct siginfo info;
diff --git a/kernel/sys.c b/kernel/sys.c
index 18da702..b50449c 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -361,6 +361,13 @@ void kernel_power_off(void)
}
EXPORT_SYMBOL_GPL(kernel_power_off);
+static void pid_namespace_reboot(struct pid_namespace *pid_ns,
+ int cmd, char *buffer)
+{
+ struct task_struct *tsk = pid_ns->child_reaper;
+ do_notify_parent_cldreboot(tsk, cmd, buffer);
+}
+
static DEFINE_MUTEX(reboot_mutex);
/*
@@ -374,13 +381,10 @@ static DEFINE_MUTEX(reboot_mutex);
SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd,
void __user *, arg)
{
- char buffer[256];
+ struct pid_namespace *pid_ns = current->nsproxy->pid_ns;
+ char buffer[256] = { 0 };
int ret = 0;
- /* We only trust the superuser with rebooting the system. */
- if (!capable(CAP_SYS_BOOT))
- return -EPERM;
-
/* For safety, we require "magic" arguments. */
if (magic1 != LINUX_REBOOT_MAGIC1 ||
(magic2 != LINUX_REBOOT_MAGIC2 &&
@@ -395,12 +399,45 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2,
unsigned int, cmd,
if ((cmd == LINUX_REBOOT_CMD_POWER_OFF) && !pm_power_off)
cmd = LINUX_REBOOT_CMD_HALT;
+ /* check the cmd parameter */
+ if (cmd != LINUX_REBOOT_CMD_RESTART &&
+#ifdef CONFIG_KEXEC
+ cmd != LINUX_REBOOT_CMD_KEXEC &&
+#endif
+#ifdef CONFIG_HIBERNATION
+ cmd != LINUX_REBOOT_CMD_SW_SUSPEND &&
+#endif
+ cmd != LINUX_REBOOT_CMD_CAD_ON &&
+ cmd != LINUX_REBOOT_CMD_CAD_OFF &&
+ cmd != LINUX_REBOOT_CMD_HALT &&
+ cmd != LINUX_REBOOT_CMD_POWER_OFF &&
+ cmd != LINUX_REBOOT_CMD_RESTART2)
+ return -EINVAL;
+
+ if (cmd == LINUX_REBOOT_CMD_RESTART2)
+ if (strncpy_from_user(&buffer[0], arg, sizeof(buffer) - 1) < 0)
+ return -EFAULT;
+
+ /* If we are not in the initial pid namespace, we send a signal
+ * to the parent of this init pid namespace, notifying a shutdown
+ * occured */
+ if (pid_ns != &init_pid_ns)
+ pid_namespace_reboot(pid_ns, cmd, buffer);
+
+ /* We only trust the superuser with rebooting the system. */
+ if (!capable(CAP_SYS_BOOT))
+ return -EPERM;
+
mutex_lock(&reboot_mutex);
switch (cmd) {
case LINUX_REBOOT_CMD_RESTART:
kernel_restart(NULL);
break;
+ case LINUX_REBOOT_CMD_RESTART2:
+ kernel_restart(buffer);
+ break;
+
case LINUX_REBOOT_CMD_CAD_ON:
C_A_D = 1;
break;
@@ -419,16 +456,6 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2,
unsigned int, cmd,
do_exit(0);
break;
- case LINUX_REBOOT_CMD_RESTART2:
- if (strncpy_from_user(&buffer[0], arg, sizeof(buffer) - 1) < 0) {
- ret = -EFAULT;
- break;
- }
- buffer[sizeof(buffer) - 1] = '\0';
-
- kernel_restart(buffer);
- break;
-
#ifdef CONFIG_KEXEC
case LINUX_REBOOT_CMD_KEXEC:
ret = kernel_kexec();
@@ -440,10 +467,6 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2,
unsigned int, cmd,
ret = hibernate();
break;
#endif
-
- default:
- ret = -EINVAL;
- break;
}
mutex_unlock(&reboot_mutex);
return ret;
With this patch, the container's init parent will receive a SIGCHLD,
with ssi_code set with the sys_reboot parameter value, when one of the
process in the container invoke sys_reboot. This solution works very
well and solves the problem but is it acceptable ? I don't see any
alternative usable for other use cases. It is probable nobody cares
about sys_reboot was called because there is nothing to do as the system
will halt a few microseconds after :)
Does anyone have an idea ?
Thanks in advance
-- Daniel
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [RFC] catching sys_reboot syscall
2011-08-08 21:14 [RFC] catching sys_reboot syscall Daniel Lezcano
@ 2011-08-10 20:10 ` Bruno Prémont
2011-08-10 20:49 ` Daniel Lezcano
2011-08-20 11:03 ` [RFC] catching sys_reboot syscall Pavel Machek
1 sibling, 1 reply; 10+ messages in thread
From: Bruno Prémont @ 2011-08-10 20:10 UTC (permalink / raw)
To: Daniel Lezcano; +Cc: Linux Kernel Mailing List, LXC Development, containers
Hi Daniel,
[I'm adding containers ml as we had a discussion there some time ago
for this feature]
One comment below:
On Mon, 08 August 2011 Daniel Lezcano <daniel.lezcano@free.fr> wrote:
> Hi all,
>
> I am trying to solve one of the problems the linux container community
> is facing :
>
> How to catch when a process called the 'reboot' syscall in a container ?
>
> In the case of a VPS, when we shutdown/halt/reboot the container, the
> reboot utility will invoke the sys_reboot syscall which has the bad
> effect to reboot the host. The way to fix that is to drop the
> CAP_SYS_REBOOT capability in the container.
>
> In this case, the container shutdowns correctly but, at the end, the
> init process is waiting indefinitely and we have the containers stuck
> with one process (the init process).
>
> In order to fix that, we used a hypervisor process, parent of the
> container's init process, watching for the container's utmp file and
> detecting when the runlevel changes. When this runlevel change is
> detected we wait for the container to have one process left and then we
> kill the container's init.
>
> That works well if we modify the distro configuration files, we make
> /var/run to not be a tmpfs and we remove all the files inside this
> directory when the container boots. *But* as soon as we upgrade the
> container distro, all the tweaks are lost. So this method works but at
> the cost of tweaking the containers configuration files again and again,
> each time there is an update, which is not tolerable in a production
> environment.
>
> This problem is easy to solve with a small hack in the kernel:
>
> diff --git a/include/asm-generic/siginfo.h b/include/asm-generic/siginfo.h
> index 942d30b..61f3d02 100644
> --- a/include/asm-generic/siginfo.h
> +++ b/include/asm-generic/siginfo.h
> @@ -218,7 +218,8 @@ typedef struct siginfo {
> #define CLD_TRAPPED (__SI_CHLD|4) /* traced child has trapped */
> #define CLD_STOPPED (__SI_CHLD|5) /* child has stopped */
> #define CLD_CONTINUED (__SI_CHLD|6) /* stopped child has continued */
> -#define NSIGCHLD 6
> +#define CLD_REBOOTED (__SI_CHLD|7) /* process was killed by a
> reboot */
> +#define NSIGCHLD 7
>
> /*
> * SIGPOLL si_codes
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 67f5688..41e0889 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2103,6 +2103,7 @@ extern int kill_pgrp(struct pid *pid, int sig, int
> priv);
> extern int kill_pid(struct pid *pid, int sig, int priv);
> extern int kill_proc_info(int, struct siginfo *, pid_t);
> extern int do_notify_parent(struct task_struct *, int);
> +extern void do_notify_parent_cldreboot(struct task_struct *, int, char *);
> extern void __wake_up_parent(struct task_struct *p, struct task_struct
> *parent);
> extern void force_sig(int, struct task_struct *);
> extern int send_sig(int, struct task_struct *, int);
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 4e3cff1..09fc254 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1522,6 +1522,46 @@ int do_notify_parent(struct task_struct *tsk, int
> sig)
> return ret;
> }
>
> +void do_notify_parent_cldreboot(struct task_struct *tsk, int why, char
> *buffer)
> +{
> + struct siginfo info = { };
> + struct task_struct *parent;
> + struct sighand_struct *sighand;
> + unsigned long flags;
> +
> + if (task_ptrace(tsk))
> + parent = tsk->parent;
> + else {
> + tsk = tsk->group_leader;
> + parent = tsk->real_parent;
> + }
> +
> + info.si_signo = SIGCHLD;
> + info.si_errno = 0;
> + info.si_status = why;
> +
> + rcu_read_lock();
> + info.si_pid = task_pid_nr_ns(tsk, parent->nsproxy->pid_ns);
> + info.si_uid = __task_cred(tsk)->uid;
> + rcu_read_unlock();
> +
> + info.si_utime = cputime_to_clock_t(tsk->utime);
> + info.si_stime = cputime_to_clock_t(tsk->stime);
> +
> + info.si_code = CLD_REBOOTED;
> +
> + sighand = parent->sighand;
> + spin_lock_irqsave(&sighand->siglock, flags);
> + if (sighand->action[SIGCHLD-1].sa.sa_handler != SIG_IGN &&
> + sighand->action[SIGCHLD-1].sa.sa_flags & SA_CLDREBOOT)
> + __group_send_sig_info(SIGCHLD, &info, parent);
> + /*
> + * Even if SIGCHLD is not generated, we must wake up wait4 calls.
> + */
> + __wake_up_parent(tsk, parent);
> + spin_unlock_irqrestore(&sighand->siglock, flags);
> +}
> +
> static void do_notify_parent_cldstop(struct task_struct *tsk, int why)
> {
> struct siginfo info;
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 18da702..b50449c 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -361,6 +361,13 @@ void kernel_power_off(void)
> }
> EXPORT_SYMBOL_GPL(kernel_power_off);
>
> +static void pid_namespace_reboot(struct pid_namespace *pid_ns,
> + int cmd, char *buffer)
> +{
> + struct task_struct *tsk = pid_ns->child_reaper;
> + do_notify_parent_cldreboot(tsk, cmd, buffer);
> +}
> +
> static DEFINE_MUTEX(reboot_mutex);
>
> /*
> @@ -374,13 +381,10 @@ static DEFINE_MUTEX(reboot_mutex);
> SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd,
> void __user *, arg)
> {
> - char buffer[256];
> + struct pid_namespace *pid_ns = current->nsproxy->pid_ns;
> + char buffer[256] = { 0 };
> int ret = 0;
>
> - /* We only trust the superuser with rebooting the system. */
> - if (!capable(CAP_SYS_BOOT))
> - return -EPERM;
> -
> /* For safety, we require "magic" arguments. */
> if (magic1 != LINUX_REBOOT_MAGIC1 ||
> (magic2 != LINUX_REBOOT_MAGIC2 &&
> @@ -395,12 +399,45 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2,
> unsigned int, cmd,
> if ((cmd == LINUX_REBOOT_CMD_POWER_OFF) && !pm_power_off)
> cmd = LINUX_REBOOT_CMD_HALT;
>
> + /* check the cmd parameter */
> + if (cmd != LINUX_REBOOT_CMD_RESTART &&
> +#ifdef CONFIG_KEXEC
> + cmd != LINUX_REBOOT_CMD_KEXEC &&
> +#endif
> +#ifdef CONFIG_HIBERNATION
> + cmd != LINUX_REBOOT_CMD_SW_SUSPEND &&
> +#endif
> + cmd != LINUX_REBOOT_CMD_CAD_ON &&
> + cmd != LINUX_REBOOT_CMD_CAD_OFF &&
> + cmd != LINUX_REBOOT_CMD_HALT &&
> + cmd != LINUX_REBOOT_CMD_POWER_OFF &&
> + cmd != LINUX_REBOOT_CMD_RESTART2)
> + return -EINVAL;
> +
> + if (cmd == LINUX_REBOOT_CMD_RESTART2)
> + if (strncpy_from_user(&buffer[0], arg, sizeof(buffer) - 1) < 0)
> + return -EFAULT;
> +
> + /* If we are not in the initial pid namespace, we send a signal
> + * to the parent of this init pid namespace, notifying a shutdown
> + * occured */
> + if (pid_ns != &init_pid_ns)
> + pid_namespace_reboot(pid_ns, cmd, buffer);
Should there be a return here?
Or does pid_namespace_reboot() never return by submitting signal to
parent?
Thanks,
Bruno
> +
> + /* We only trust the superuser with rebooting the system. */
> + if (!capable(CAP_SYS_BOOT))
> + return -EPERM;
> +
> mutex_lock(&reboot_mutex);
> switch (cmd) {
> case LINUX_REBOOT_CMD_RESTART:
> kernel_restart(NULL);
> break;
>
> + case LINUX_REBOOT_CMD_RESTART2:
> + kernel_restart(buffer);
> + break;
> +
> case LINUX_REBOOT_CMD_CAD_ON:
> C_A_D = 1;
> break;
> @@ -419,16 +456,6 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2,
> unsigned int, cmd,
> do_exit(0);
> break;
>
> - case LINUX_REBOOT_CMD_RESTART2:
> - if (strncpy_from_user(&buffer[0], arg, sizeof(buffer) - 1) < 0) {
> - ret = -EFAULT;
> - break;
> - }
> - buffer[sizeof(buffer) - 1] = '\0';
> -
> - kernel_restart(buffer);
> - break;
> -
> #ifdef CONFIG_KEXEC
> case LINUX_REBOOT_CMD_KEXEC:
> ret = kernel_kexec();
> @@ -440,10 +467,6 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2,
> unsigned int, cmd,
> ret = hibernate();
> break;
> #endif
> -
> - default:
> - ret = -EINVAL;
> - break;
> }
> mutex_unlock(&reboot_mutex);
> return ret;
>
>
> With this patch, the container's init parent will receive a SIGCHLD,
> with ssi_code set with the sys_reboot parameter value, when one of the
> process in the container invoke sys_reboot. This solution works very
> well and solves the problem but is it acceptable ? I don't see any
> alternative usable for other use cases. It is probable nobody cares
> about sys_reboot was called because there is nothing to do as the system
> will halt a few microseconds after :)
>
> Does anyone have an idea ?
>
> Thanks in advance
> -- Daniel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC] catching sys_reboot syscall
2011-08-10 20:10 ` Bruno Prémont
@ 2011-08-10 20:49 ` Daniel Lezcano
2011-08-11 16:30 ` Bruno Prémont
0 siblings, 1 reply; 10+ messages in thread
From: Daniel Lezcano @ 2011-08-10 20:49 UTC (permalink / raw)
To: Bruno Prémont; +Cc: Linux Kernel Mailing List, LXC Development, containers
On 08/10/2011 10:10 PM, Bruno Prémont wrote:
> Hi Daniel,
>
> [I'm adding containers ml as we had a discussion there some time ago
> for this feature]
[ ... ]
>> + if (cmd == LINUX_REBOOT_CMD_RESTART2)
>> + if (strncpy_from_user(&buffer[0], arg, sizeof(buffer) - 1) < 0)
>> + return -EFAULT;
>> +
>> + /* If we are not in the initial pid namespace, we send a signal
>> + * to the parent of this init pid namespace, notifying a shutdown
>> + * occured */
>> + if (pid_ns != &init_pid_ns)
>> + pid_namespace_reboot(pid_ns, cmd, buffer);
> Should there be a return here?
> Or does pid_namespace_reboot() never return by submitting signal to
> parent?
Yes, it does not return a value, like 'do_notify_parent_cldstop'
Thanks
-- Daniel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC] catching sys_reboot syscall
2011-08-10 20:49 ` Daniel Lezcano
@ 2011-08-11 16:30 ` Bruno Prémont
2011-08-11 16:49 ` Daniel Lezcano
0 siblings, 1 reply; 10+ messages in thread
From: Bruno Prémont @ 2011-08-11 16:30 UTC (permalink / raw)
To: Daniel Lezcano; +Cc: Linux Kernel Mailing List, LXC Development, containers
On Wed, 10 August 2011 Daniel Lezcano <daniel.lezcano@free.fr> wrote:
> On 08/10/2011 10:10 PM, Bruno Prémont wrote:
> > Hi Daniel,
> >
> > [I'm adding containers ml as we had a discussion there some time ago
> > for this feature]
>
> [ ... ]
>
> >> + if (cmd == LINUX_REBOOT_CMD_RESTART2)
> >> + if (strncpy_from_user(&buffer[0], arg, sizeof(buffer) - 1) < 0)
> >> + return -EFAULT;
> >> +
> >> + /* If we are not in the initial pid namespace, we send a signal
> >> + * to the parent of this init pid namespace, notifying a shutdown
> >> + * occured */
> >> + if (pid_ns != &init_pid_ns)
> >> + pid_namespace_reboot(pid_ns, cmd, buffer);
> > Should there be a return here?
> > Or does pid_namespace_reboot() never return by submitting signal to
> > parent?
>
> Yes, it does not return a value, like 'do_notify_parent_cldstop'
So execution flow continues reaching the whole "host reboot code"?
That's not so good as it then prevents using CAP_SYS_BOOT inside PID namespace
to limit access to rebooting the container from inside as giving a process
inside container CAP_SYS_BOOT would cause host to reboot (and when not given
process inside container would get -EPERM in all cases).
Wouldn't the following be better?:
...
+
+ /* We only trust the superuser with rebooting the system. */
+ if (!capable(CAP_SYS_BOOT))
+ return -EPERM;
+
+ /* If we are not in the initial pid namespace, we send a signal
+ * to the parent of this init pid namespace, notifying a shutdown
+ * occured */
+ if (pid_ns != &init_pid_ns) {
+ pid_namespace_reboot(pid_ns, cmd, buffer);
+ return 0;
+ }
+
mutex_lock(&reboot_mutex);
switch (cmd) {
...
If I misunderstood, please correct me.
Thanks,
Bruno
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC] catching sys_reboot syscall
2011-08-11 16:30 ` Bruno Prémont
@ 2011-08-11 16:49 ` Daniel Lezcano
2011-08-11 17:04 ` Bruno Prémont
0 siblings, 1 reply; 10+ messages in thread
From: Daniel Lezcano @ 2011-08-11 16:49 UTC (permalink / raw)
To: Bruno Prémont; +Cc: Linux Kernel Mailing List, LXC Development, containers
On 08/11/2011 06:30 PM, Bruno Prémont wrote:
> On Wed, 10 August 2011 Daniel Lezcano <daniel.lezcano@free.fr> wrote:
>> On 08/10/2011 10:10 PM, Bruno Prémont wrote:
>>> Hi Daniel,
>>>
>>> [I'm adding containers ml as we had a discussion there some time ago
>>> for this feature]
>> [ ... ]
>>
>>>> + if (cmd == LINUX_REBOOT_CMD_RESTART2)
>>>> + if (strncpy_from_user(&buffer[0], arg, sizeof(buffer) - 1) < 0)
>>>> + return -EFAULT;
>>>> +
>>>> + /* If we are not in the initial pid namespace, we send a signal
>>>> + * to the parent of this init pid namespace, notifying a shutdown
>>>> + * occured */
>>>> + if (pid_ns != &init_pid_ns)
>>>> + pid_namespace_reboot(pid_ns, cmd, buffer);
>>> Should there be a return here?
>>> Or does pid_namespace_reboot() never return by submitting signal to
>>> parent?
>> Yes, it does not return a value, like 'do_notify_parent_cldstop'
> So execution flow continues reaching the whole "host reboot code"?
>
> That's not so good as it then prevents using CAP_SYS_BOOT inside PID namespace
> to limit access to rebooting the container from inside as giving a process
> inside container CAP_SYS_BOOT would cause host to reboot (and when not given
> process inside container would get -EPERM in all cases).
>
> Wouldn't the following be better?:
> ...
> +
> + /* We only trust the superuser with rebooting the system. */
> + if (!capable(CAP_SYS_BOOT))
> + return -EPERM;
> +
> + /* If we are not in the initial pid namespace, we send a signal
> + * to the parent of this init pid namespace, notifying a shutdown
> + * occured */
> + if (pid_ns != &init_pid_ns) {
> + pid_namespace_reboot(pid_ns, cmd, buffer);
> + return 0;
> + }
> +
> mutex_lock(&reboot_mutex);
> switch (cmd) {
> ...
>
>
> If I misunderstood, please correct me.
Yep, this is what I did at the beginning but I realized I was closing
the door for future applications using the pid namespaces. The pid
namespace could be used by another kind of application, not a container,
running some administrative tasks so they may want to shutdown the host
from a different pid namespace.
For this reason, to prevent this execution flow, the container has to
drop the CAP_SYS_BOOT in addition of taking care of the SIGCHLD signal
with CLDREBOOT.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC] catching sys_reboot syscall
2011-08-11 16:49 ` Daniel Lezcano
@ 2011-08-11 17:04 ` Bruno Prémont
2011-08-11 18:10 ` [lxc-devel] " Daniel Lezcano
2011-08-11 18:10 ` Serge Hallyn
0 siblings, 2 replies; 10+ messages in thread
From: Bruno Prémont @ 2011-08-11 17:04 UTC (permalink / raw)
To: Daniel Lezcano; +Cc: Linux Kernel Mailing List, LXC Development, containers
On Thu, 11 August 2011 Daniel Lezcano <daniel.lezcano@free.fr> wrote:
> On 08/11/2011 06:30 PM, Bruno Prémont wrote:
> > On Wed, 10 August 2011 Daniel Lezcano <daniel.lezcano@free.fr> wrote:
> >> On 08/10/2011 10:10 PM, Bruno Prémont wrote:
> >>> Hi Daniel,
> >>>
> >>> [I'm adding containers ml as we had a discussion there some time ago
> >>> for this feature]
> >> [ ... ]
> >>
> >>>> + if (cmd == LINUX_REBOOT_CMD_RESTART2)
> >>>> + if (strncpy_from_user(&buffer[0], arg, sizeof(buffer) - 1) < 0)
> >>>> + return -EFAULT;
> >>>> +
> >>>> + /* If we are not in the initial pid namespace, we send a signal
> >>>> + * to the parent of this init pid namespace, notifying a shutdown
> >>>> + * occured */
> >>>> + if (pid_ns != &init_pid_ns)
> >>>> + pid_namespace_reboot(pid_ns, cmd, buffer);
> >>> Should there be a return here?
> >>> Or does pid_namespace_reboot() never return by submitting signal to
> >>> parent?
> >> Yes, it does not return a value, like 'do_notify_parent_cldstop'
> > So execution flow continues reaching the whole "host reboot code"?
> >
> > That's not so good as it then prevents using CAP_SYS_BOOT inside PID namespace
> > to limit access to rebooting the container from inside as giving a process
> > inside container CAP_SYS_BOOT would cause host to reboot (and when not given
> > process inside container would get -EPERM in all cases).
> >
> > Wouldn't the following be better?:
> > ...
> > +
> > + /* We only trust the superuser with rebooting the system. */
> > + if (!capable(CAP_SYS_BOOT))
> > + return -EPERM;
> > +
> > + /* If we are not in the initial pid namespace, we send a signal
> > + * to the parent of this init pid namespace, notifying a shutdown
> > + * occured */
> > + if (pid_ns != &init_pid_ns) {
> > + pid_namespace_reboot(pid_ns, cmd, buffer);
> > + return 0;
> > + }
> > +
> > mutex_lock(&reboot_mutex);
> > switch (cmd) {
> > ...
> >
> >
> > If I misunderstood, please correct me.
>
> Yep, this is what I did at the beginning but I realized I was closing
> the door for future applications using the pid namespaces. The pid
> namespace could be used by another kind of application, not a container,
> running some administrative tasks so they may want to shutdown the host
> from a different pid namespace.
>
> For this reason, to prevent this execution flow, the container has to
> drop the CAP_SYS_BOOT in addition of taking care of the SIGCHLD signal
> with CLDREBOOT.
Ok, though for later source code readers to know adding/extending comment
would be nice.
Maybe something like
+ /* If we are not in the initial pid namespace, we send a signal
+ * to the parent of this init pid namespace, notifying a shutdown
+ * occured
+ * NOTE: if process has CAP_SYS_BOOT it will additionally have the
+ * same effect as if it was not namespaced */
How would all of this integrate with the ongoing work on user namespaces?
Maybe that one should later be the differentiator for who may or may not
trigger the host reboot.
In addition sending the signal to parent process seems moot as chances are
that parent process will never have the opportunity to see the signal when
the host is being rebooted.
Then a construct like the following would give a better hint to the reader:
...
+
+ /* We only trust the superuser with rebooting the system. */
+ if (!capable(CAP_SYS_BOOT)) {
+ /* If we are not in the initial pid namespace, we send a signal
+ * to the parent of this init pid namespace, notifying a shutdown
+ * occured */
+ if (pid_ns != &init_pid_ns)
+ pid_namespace_reboot(pid_ns, cmd, buffer);
+
+ return -EPERM;
+ }
+
...
Thanks,
Bruno
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [lxc-devel] [RFC] catching sys_reboot syscall
2011-08-11 17:04 ` Bruno Prémont
@ 2011-08-11 18:10 ` Daniel Lezcano
2011-08-11 18:10 ` Serge Hallyn
1 sibling, 0 replies; 10+ messages in thread
From: Daniel Lezcano @ 2011-08-11 18:10 UTC (permalink / raw)
To: Bruno Prémont
Cc: containers, LXC, Linux Kernel Mailing List, Development
On 08/11/2011 07:04 PM, Bruno Prémont wrote:
> On Thu, 11 August 2011 Daniel Lezcano <daniel.lezcano@free.fr> wrote:
>> On 08/11/2011 06:30 PM, Bruno Prémont wrote:
>>> On Wed, 10 August 2011 Daniel Lezcano <daniel.lezcano@free.fr> wrote:
>>>> On 08/10/2011 10:10 PM, Bruno Prémont wrote:
>>>>> Hi Daniel,
>>>>>
>>>>> [I'm adding containers ml as we had a discussion there some time ago
>>>>> for this feature]
>>>> [ ... ]
>>>>
>>>>>> + if (cmd == LINUX_REBOOT_CMD_RESTART2)
>>>>>> + if (strncpy_from_user(&buffer[0], arg, sizeof(buffer) - 1) < 0)
>>>>>> + return -EFAULT;
>>>>>> +
>>>>>> + /* If we are not in the initial pid namespace, we send a signal
>>>>>> + * to the parent of this init pid namespace, notifying a shutdown
>>>>>> + * occured */
>>>>>> + if (pid_ns != &init_pid_ns)
>>>>>> + pid_namespace_reboot(pid_ns, cmd, buffer);
>>>>> Should there be a return here?
>>>>> Or does pid_namespace_reboot() never return by submitting signal to
>>>>> parent?
>>>> Yes, it does not return a value, like 'do_notify_parent_cldstop'
>>> So execution flow continues reaching the whole "host reboot code"?
>>>
>>> That's not so good as it then prevents using CAP_SYS_BOOT inside PID namespace
>>> to limit access to rebooting the container from inside as giving a process
>>> inside container CAP_SYS_BOOT would cause host to reboot (and when not given
>>> process inside container would get -EPERM in all cases).
>>>
>>> Wouldn't the following be better?:
>>> ...
>>> +
>>> + /* We only trust the superuser with rebooting the system. */
>>> + if (!capable(CAP_SYS_BOOT))
>>> + return -EPERM;
>>> +
>>> + /* If we are not in the initial pid namespace, we send a signal
>>> + * to the parent of this init pid namespace, notifying a shutdown
>>> + * occured */
>>> + if (pid_ns != &init_pid_ns) {
>>> + pid_namespace_reboot(pid_ns, cmd, buffer);
>>> + return 0;
>>> + }
>>> +
>>> mutex_lock(&reboot_mutex);
>>> switch (cmd) {
>>> ...
>>>
>>>
>>> If I misunderstood, please correct me.
>>
>> Yep, this is what I did at the beginning but I realized I was closing
>> the door for future applications using the pid namespaces. The pid
>> namespace could be used by another kind of application, not a container,
>> running some administrative tasks so they may want to shutdown the host
>> from a different pid namespace.
>>
>> For this reason, to prevent this execution flow, the container has to
>> drop the CAP_SYS_BOOT in addition of taking care of the SIGCHLD signal
>> with CLDREBOOT.
>
> Ok, though for later source code readers to know adding/extending comment
> would be nice.
> Maybe something like
>
> + /* If we are not in the initial pid namespace, we send a signal
> + * to the parent of this init pid namespace, notifying a shutdown
> + * occured
> + * NOTE: if process has CAP_SYS_BOOT it will additionally have the
> + * same effect as if it was not namespaced */
>
>
> How would all of this integrate with the ongoing work on user namespaces?
> Maybe that one should later be the differentiator for who may or may not
> trigger the host reboot.
I think if you are in a different user namespace than the init one, the
process won't be able to reboot.
I talked with Serge about that and he should execute the
pid_namespace_reboot if it is 'ns_capable' of rebooting the host.
But I think that does not collide after all.
> In addition sending the signal to parent process seems moot as chances are
> that parent process will never have the opportunity to see the signal when
> the host is being rebooted.
Right.
> Then a construct like the following would give a better hint to the reader:
> ...
> +
> + /* We only trust the superuser with rebooting the system. */
> + if (!capable(CAP_SYS_BOOT)) {
> + /* If we are not in the initial pid namespace, we send a signal
> + * to the parent of this init pid namespace, notifying a shutdown
> + * occured */
> + if (pid_ns != &init_pid_ns)
> + pid_namespace_reboot(pid_ns, cmd, buffer);
> +
> + return -EPERM;
> + }
Ok, let me respin the patchset and change that. I will submit the patch
to akpm and lkml. Let's see what they think about this approach.
Thanks
-- Daniel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC] catching sys_reboot syscall
2011-08-11 17:04 ` Bruno Prémont
2011-08-11 18:10 ` [lxc-devel] " Daniel Lezcano
@ 2011-08-11 18:10 ` Serge Hallyn
2011-08-11 18:40 ` [PATCH] add pid->user_ns Serge Hallyn
1 sibling, 1 reply; 10+ messages in thread
From: Serge Hallyn @ 2011-08-11 18:10 UTC (permalink / raw)
To: Bruno Prémont
Cc: Daniel Lezcano, containers, Linux Kernel Mailing List,
LXC Development
Quoting Bruno Prémont (bonbons@linux-vserver.org):
> On Thu, 11 August 2011 Daniel Lezcano <daniel.lezcano@free.fr> wrote:
> > On 08/11/2011 06:30 PM, Bruno Prémont wrote:
> > > On Wed, 10 August 2011 Daniel Lezcano <daniel.lezcano@free.fr> wrote:
> > >> On 08/10/2011 10:10 PM, Bruno Prémont wrote:
> > >>> Hi Daniel,
> > >>>
> > >>> [I'm adding containers ml as we had a discussion there some time ago
> > >>> for this feature]
> > >> [ ... ]
> > >>
> > >>>> + if (cmd == LINUX_REBOOT_CMD_RESTART2)
> > >>>> + if (strncpy_from_user(&buffer[0], arg, sizeof(buffer) - 1) < 0)
> > >>>> + return -EFAULT;
> > >>>> +
> > >>>> + /* If we are not in the initial pid namespace, we send a signal
> > >>>> + * to the parent of this init pid namespace, notifying a shutdown
> > >>>> + * occured */
> > >>>> + if (pid_ns != &init_pid_ns)
> > >>>> + pid_namespace_reboot(pid_ns, cmd, buffer);
> > >>> Should there be a return here?
> > >>> Or does pid_namespace_reboot() never return by submitting signal to
> > >>> parent?
> > >> Yes, it does not return a value, like 'do_notify_parent_cldstop'
> > > So execution flow continues reaching the whole "host reboot code"?
> > >
> > > That's not so good as it then prevents using CAP_SYS_BOOT inside PID namespace
> > > to limit access to rebooting the container from inside as giving a process
> > > inside container CAP_SYS_BOOT would cause host to reboot (and when not given
> > > process inside container would get -EPERM in all cases).
> > >
> > > Wouldn't the following be better?:
> > > ...
> > > +
> > > + /* We only trust the superuser with rebooting the system. */
> > > + if (!capable(CAP_SYS_BOOT))
> > > + return -EPERM;
> > > +
> > > + /* If we are not in the initial pid namespace, we send a signal
> > > + * to the parent of this init pid namespace, notifying a shutdown
> > > + * occured */
> > > + if (pid_ns != &init_pid_ns) {
> > > + pid_namespace_reboot(pid_ns, cmd, buffer);
> > > + return 0;
> > > + }
> > > +
> > > mutex_lock(&reboot_mutex);
> > > switch (cmd) {
> > > ...
> > >
> > >
> > > If I misunderstood, please correct me.
> >
> > Yep, this is what I did at the beginning but I realized I was closing
> > the door for future applications using the pid namespaces. The pid
> > namespace could be used by another kind of application, not a container,
> > running some administrative tasks so they may want to shutdown the host
> > from a different pid namespace.
> >
> > For this reason, to prevent this execution flow, the container has to
> > drop the CAP_SYS_BOOT in addition of taking care of the SIGCHLD signal
> > with CLDREBOOT.
>
> Ok, though for later source code readers to know adding/extending comment
> would be nice.
> Maybe something like
>
> + /* If we are not in the initial pid namespace, we send a signal
> + * to the parent of this init pid namespace, notifying a shutdown
> + * occured
> + * NOTE: if process has CAP_SYS_BOOT it will additionally have the
> + * same effect as if it was not namespaced */
>
>
> How would all of this integrate with the ongoing work on user namespaces?
> Maybe that one should later be the differentiator for who may or may not
> trigger the host reboot.
Right, then you'll be able to do:
if (ns_capable(current_pid_ns()->user_ns, CAP_SYS_BOOT)) {
// do container reboot stuff
}
if (!capable(CAP_SYS_BOOT)
return;
// do host reboot stuff
The first checks whether the task has privilege against the user ns
which owns his pid_ns.
The second one is a synonym for
if (!ns_capable(&init_user_ns, CAP_SYS_BOOT))
where init_user_ns is the owner of all physical resources.
Right now pid_ns doesn't yet have a user_ns owner (except in my patch over
here: http://kernel.ubuntu.com/git?p=serge/linux-syslogns.git;a=commit;h=63556e9a39bcd75ec4a88333425800905013c73e )
and, if it did, well you can't yet do enough in a user namespace to run
a container. But that'll be the ideal. Hopefully soon...
-serge
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH] add pid->user_ns
2011-08-11 18:10 ` Serge Hallyn
@ 2011-08-11 18:40 ` Serge Hallyn
0 siblings, 0 replies; 10+ messages in thread
From: Serge Hallyn @ 2011-08-11 18:40 UTC (permalink / raw)
To: Bruno Prémont
Cc: containers, LXC Development, Linux Kernel Mailing List, daniel,
eric
[ Here is the patch which you'd need to be able to add the boot
check against pid_ns ]
This will allow us to check whether a task has privilege over the
pid namespace.
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
---
include/linux/pid_namespace.h | 9 +++++++--
kernel/nsproxy.c | 2 +-
kernel/pid.c | 1 +
kernel/pid_namespace.c | 13 ++++++++++---
4 files changed, 19 insertions(+), 6 deletions(-)
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 38d1032..c1b5a48 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -7,6 +7,9 @@
#include <linux/nsproxy.h>
#include <linux/kref.h>
+struct user_namespace;
+extern struct user_namespace init_user_ns;
+
struct pidmap {
atomic_t nr_free;
void *page;
@@ -30,6 +33,7 @@ struct pid_namespace {
#ifdef CONFIG_BSD_PROCESS_ACCT
struct bsd_acct_struct *bacct;
#endif
+ struct user_namespace *user_ns;
};
extern struct pid_namespace init_pid_ns;
@@ -42,7 +46,7 @@ static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
return ns;
}
-extern struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *ns);
+extern struct pid_namespace *copy_pid_ns(unsigned long flags, struct task_struct *tsk);
extern void free_pid_ns(struct kref *kref);
extern void zap_pid_ns_processes(struct pid_namespace *pid_ns);
@@ -61,8 +65,9 @@ static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
}
static inline struct pid_namespace *
-copy_pid_ns(unsigned long flags, struct pid_namespace *ns)
+copy_pid_ns(unsigned long flags, struct task_struct *tsk)
{
+ struct pid_namespace *ns = task_active_pid_ns(tsk);
if (flags & CLONE_NEWPID)
ns = ERR_PTR(-EINVAL);
return ns;
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 9aeab4b..97e21ea 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -84,7 +84,7 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
goto out_ipc;
}
- new_nsp->pid_ns = copy_pid_ns(flags, task_active_pid_ns(tsk));
+ new_nsp->pid_ns = copy_pid_ns(flags, tsk);
if (IS_ERR(new_nsp->pid_ns)) {
err = PTR_ERR(new_nsp->pid_ns);
goto out_pid;
diff --git a/kernel/pid.c b/kernel/pid.c
index e432057..4a1e66f 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -78,6 +78,7 @@ struct pid_namespace init_pid_ns = {
.last_pid = 0,
.level = 0,
.child_reaper = &init_task,
+ .user_ns = &init_user_ns,
};
EXPORT_SYMBOL_GPL(init_pid_ns);
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index e9c9adc..6818ea5 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -10,6 +10,7 @@
#include <linux/pid.h>
#include <linux/pid_namespace.h>
+#include <linux/user_namespace.h>
#include <linux/syscalls.h>
#include <linux/err.h>
#include <linux/acct.h>
@@ -69,7 +70,8 @@ err_alloc:
return NULL;
}
-static struct pid_namespace *create_pid_namespace(struct pid_namespace *parent_pid_ns)
+static struct pid_namespace *create_pid_namespace(struct task_struct *tsk,
+ struct pid_namespace *parent_pid_ns)
{
struct pid_namespace *ns;
unsigned int level = parent_pid_ns->level + 1;
@@ -97,6 +99,8 @@ static struct pid_namespace *create_pid_namespace(struct pid_namespace *parent_p
for (i = 1; i < PIDMAP_ENTRIES; i++)
atomic_set(&ns->pidmap[i].nr_free, BITS_PER_PAGE);
+ ns->user_ns = get_user_ns(task_cred_xxx(tsk, user)->user_ns);
+
err = pid_ns_prepare_proc(ns);
if (err)
goto out_put_parent_pid_ns;
@@ -122,13 +126,15 @@ static void destroy_pid_namespace(struct pid_namespace *ns)
kmem_cache_free(pid_ns_cachep, ns);
}
-struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *old_ns)
+struct pid_namespace *copy_pid_ns(unsigned long flags, struct task_struct *tsk)
{
+ struct pid_namespace *old_ns = task_active_pid_ns(tsk);
+
if (!(flags & CLONE_NEWPID))
return get_pid_ns(old_ns);
if (flags & (CLONE_THREAD|CLONE_PARENT))
return ERR_PTR(-EINVAL);
- return create_pid_namespace(old_ns);
+ return create_pid_namespace(tsk, old_ns);
}
void free_pid_ns(struct kref *kref)
@@ -139,6 +145,7 @@ void free_pid_ns(struct kref *kref)
parent = ns->parent;
destroy_pid_namespace(ns);
+ put_user_ns(ns->user_ns);
if (parent != NULL)
put_pid_ns(parent);
--
1.7.5.4
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [RFC] catching sys_reboot syscall
2011-08-08 21:14 [RFC] catching sys_reboot syscall Daniel Lezcano
2011-08-10 20:10 ` Bruno Prémont
@ 2011-08-20 11:03 ` Pavel Machek
1 sibling, 0 replies; 10+ messages in thread
From: Pavel Machek @ 2011-08-20 11:03 UTC (permalink / raw)
To: Daniel Lezcano; +Cc: Linux Kernel Mailing List, LXC Development
Hi!
> In this case, the container shutdowns correctly but, at the end, the
> init process is waiting indefinitely and we have the containers stuck
> with one process (the init process).
>
> In order to fix that, we used a hypervisor process, parent of the
> container's init process, watching for the container's utmp file and
> detecting when the runlevel changes. When this runlevel change is
> detected we wait for the container to have one process left and then we
> kill the container's init.
>
> That works well if we modify the distro configuration files, we make
> /var/run to not be a tmpfs and we remove all the files inside this
> directory when the container boots. *But* as soon as we upgrade the
> container distro, all the tweaks are lost. So this method works but at
> the cost of tweaking the containers configuration files again and again,
> each time there is an update, which is not tolerable in a production
> environment.
>
> This problem is easy to solve with a small hack in the kernel:
Hmm. If you just made sys_reboot() equivalent to exit() for container
case... perhaps patch would be even simpler..?
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2011-08-20 11:03 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-08 21:14 [RFC] catching sys_reboot syscall Daniel Lezcano
2011-08-10 20:10 ` Bruno Prémont
2011-08-10 20:49 ` Daniel Lezcano
2011-08-11 16:30 ` Bruno Prémont
2011-08-11 16:49 ` Daniel Lezcano
2011-08-11 17:04 ` Bruno Prémont
2011-08-11 18:10 ` [lxc-devel] " Daniel Lezcano
2011-08-11 18:10 ` Serge Hallyn
2011-08-11 18:40 ` [PATCH] add pid->user_ns Serge Hallyn
2011-08-20 11:03 ` [RFC] catching sys_reboot syscall Pavel Machek
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox