* [PATCH 1/7][v8] Remove 'handler' parameter to tracehook functions
2009-02-19 3:02 [PATCH 0/7][v8] Container-init signal semantics Sukadev Bhattiprolu
@ 2009-02-19 3:05 ` Sukadev Bhattiprolu
2009-02-19 3:05 ` [PATCH 2/7][v8] Protect init from unwanted signals more Sukadev Bhattiprolu
` (7 subsequent siblings)
8 siblings, 0 replies; 29+ messages in thread
From: Sukadev Bhattiprolu @ 2009-02-19 3:05 UTC (permalink / raw)
To: Andrew Morton
Cc: Oleg Nesterov, roland, Eric W. Biederman, daniel, Containers,
linux-kernel
From: Oleg Nesterov <oleg@redhat.com>
Date: Wed, 24 Dec 2008 13:35:17 -0800
Subject: [PATCH 1/7][v8] Remove 'handler' parameter to tracehook functions
Based on an earlier patch submitted by Oleg Nesterov and comments
from Roland McGrath (http://lkml.org/lkml/2008/11/19/258).
The handler parameter is currently unused in the tracehook functions.
Besides, the tracehook functions are called with siglock held, so the
functions can check the handler if they later need to.
Removing the parameter simiplifies changes to sig_ignored() in a follow-on
patch.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Roland McGrath <roland@redhat.com>
---
arch/x86/kernel/ptrace.c | 2 +-
include/linux/tracehook.h | 13 ++++---------
kernel/signal.c | 6 +++---
3 files changed, 8 insertions(+), 13 deletions(-)
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 5a4c23d..2117492 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1461,6 +1461,6 @@ asmregparm void syscall_trace_leave(struct pt_regs *regs)
* system call instruction.
*/
if (test_thread_flag(TIF_SINGLESTEP) &&
- tracehook_consider_fatal_signal(current, SIGTRAP, SIG_DFL))
+ tracehook_consider_fatal_signal(current, SIGTRAP))
send_sigtrap(current, regs, 0, TRAP_BRKPT);
}
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index c913155..c7aa154 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -388,17 +388,14 @@ static inline void tracehook_signal_handler(int sig, siginfo_t *info,
* tracehook_consider_ignored_signal - suppress short-circuit of ignored signal
* @task: task receiving the signal
* @sig: signal number being sent
- * @handler: %SIG_IGN or %SIG_DFL
*
* Return zero iff tracing doesn't care to examine this ignored signal,
* so it can short-circuit normal delivery and never even get queued.
- * Either @handler is %SIG_DFL and @sig's default is ignore, or it's %SIG_IGN.
*
* Called with @task->sighand->siglock held.
*/
static inline int tracehook_consider_ignored_signal(struct task_struct *task,
- int sig,
- void __user *handler)
+ int sig)
{
return (task_ptrace(task) & PT_PTRACED) != 0;
}
@@ -407,19 +404,17 @@ static inline int tracehook_consider_ignored_signal(struct task_struct *task,
* tracehook_consider_fatal_signal - suppress special handling of fatal signal
* @task: task receiving the signal
* @sig: signal number being sent
- * @handler: %SIG_DFL or %SIG_IGN
*
* Return nonzero to prevent special handling of this termination signal.
- * Normally @handler is %SIG_DFL. It can be %SIG_IGN if @sig is ignored,
- * in which case force_sig() is about to reset it to %SIG_DFL.
+ * Normally handler for signal is %SIG_DFL. It can be %SIG_IGN if @sig is
+ * ignored, in which case force_sig() is about to reset it to %SIG_DFL.
* When this returns zero, this signal might cause a quick termination
* that does not give the debugger a chance to intercept the signal.
*
* Called with or without @task->sighand->siglock held.
*/
static inline int tracehook_consider_fatal_signal(struct task_struct *task,
- int sig,
- void __user *handler)
+ int sig)
{
return (task_ptrace(task) & PT_PTRACED) != 0;
}
diff --git a/kernel/signal.c b/kernel/signal.c
index 2a74fe8..ee8a8f1 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -74,7 +74,7 @@ static int sig_ignored(struct task_struct *t, int sig)
/*
* Tracers may want to know about even ignored signals.
*/
- return !tracehook_consider_ignored_signal(t, sig, handler);
+ return !tracehook_consider_ignored_signal(t, sig);
}
/*
@@ -318,7 +318,7 @@ int unhandled_signal(struct task_struct *tsk, int sig)
return 1;
if (handler != SIG_IGN && handler != SIG_DFL)
return 0;
- return !tracehook_consider_fatal_signal(tsk, sig, handler);
+ return !tracehook_consider_fatal_signal(tsk, sig);
}
@@ -777,7 +777,7 @@ static void complete_signal(int sig, struct task_struct *p, int group)
!(signal->flags & (SIGNAL_UNKILLABLE | SIGNAL_GROUP_EXIT)) &&
!sigismember(&t->real_blocked, sig) &&
(sig == SIGKILL ||
- !tracehook_consider_fatal_signal(t, sig, SIG_DFL))) {
+ !tracehook_consider_fatal_signal(t, sig))) {
/*
* This signal will be fatal to the whole group.
*/
--
1.5.2.5
^ permalink raw reply related [flat|nested] 29+ messages in thread* [PATCH 2/7][v8] Protect init from unwanted signals more
2009-02-19 3:02 [PATCH 0/7][v8] Container-init signal semantics Sukadev Bhattiprolu
2009-02-19 3:05 ` [PATCH 1/7][v8] Remove 'handler' parameter to tracehook functions Sukadev Bhattiprolu
@ 2009-02-19 3:05 ` Sukadev Bhattiprolu
2009-02-19 3:06 ` [PATCH 3/7][v8] Add from_ancestor_ns parameter to send_signal() Sukadev Bhattiprolu
` (6 subsequent siblings)
8 siblings, 0 replies; 29+ messages in thread
From: Sukadev Bhattiprolu @ 2009-02-19 3:05 UTC (permalink / raw)
To: Andrew Morton
Cc: Oleg Nesterov, roland, Eric W. Biederman, daniel, Containers,
linux-kernel
From: Oleg Nesterov <oleg@redhat.com>
Date: Wed, 24 Dec 2008 13:35:23 -0800
Subject: [PATCH 2/7][v8] Protect init from unwanted signals more
(This is a modified version of the patch submitted by Oleg Nesterov
http://lkml.org/lkml/2008/11/18/249 and tries to address comments
that came up in that discussion)
init ignores the SIG_DFL signals but we queue them anyway, including
SIGKILL. This is mostly OK, the signal will be dropped silently when
dequeued, but the pending SIGKILL has 2 bad implications:
- it implies fatal_signal_pending(), so we confuse things
like wait_for_completion_killable/lock_page_killable.
- for the sub-namespace inits, the pending SIGKILL can
mask (legacy_queue) the subsequent SIGKILL from the
parent namespace which must kill cinit reliably.
(preparation, cinits don't have SIGNAL_UNKILLABLE yet)
The patch can't help when init is ptraced, but ptracing of init is
not "safe" anyway.
Changelog[v5]:
- (Oleg Nesterov) Remove SIG_IGN check in sig_task_ignored()
and let sig_handler_ignored() check SIG_IGN.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Roland McGrath <roland@redhat.com>
---
kernel/signal.c | 16 +++++++++++++---
1 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/kernel/signal.c b/kernel/signal.c
index ee8a8f1..8761e4c 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -55,10 +55,21 @@ static int sig_handler_ignored(void __user *handler, int sig)
(handler == SIG_DFL && sig_kernel_ignore(sig));
}
-static int sig_ignored(struct task_struct *t, int sig)
+static int sig_task_ignored(struct task_struct *t, int sig)
{
void __user *handler;
+ handler = sig_handler(t, sig);
+
+ if (unlikely(t->signal->flags & SIGNAL_UNKILLABLE) &&
+ handler == SIG_DFL)
+ return 1;
+
+ return sig_handler_ignored(handler, sig);
+}
+
+static int sig_ignored(struct task_struct *t, int sig)
+{
/*
* Blocked signals are never ignored, since the
* signal handler may change by the time it is
@@ -67,8 +78,7 @@ static int sig_ignored(struct task_struct *t, int sig)
if (sigismember(&t->blocked, sig) || sigismember(&t->real_blocked, sig))
return 0;
- handler = sig_handler(t, sig);
- if (!sig_handler_ignored(handler, sig))
+ if (!sig_task_ignored(t, sig))
return 0;
/*
--
1.5.2.5
^ permalink raw reply related [flat|nested] 29+ messages in thread* [PATCH 3/7][v8] Add from_ancestor_ns parameter to send_signal()
2009-02-19 3:02 [PATCH 0/7][v8] Container-init signal semantics Sukadev Bhattiprolu
2009-02-19 3:05 ` [PATCH 1/7][v8] Remove 'handler' parameter to tracehook functions Sukadev Bhattiprolu
2009-02-19 3:05 ` [PATCH 2/7][v8] Protect init from unwanted signals more Sukadev Bhattiprolu
@ 2009-02-19 3:06 ` Sukadev Bhattiprolu
2009-02-19 3:06 ` [PATCH 4/7][v8] Protect cinit from unblocked SIG_DFL signals Sukadev Bhattiprolu
` (5 subsequent siblings)
8 siblings, 0 replies; 29+ messages in thread
From: Sukadev Bhattiprolu @ 2009-02-19 3:06 UTC (permalink / raw)
To: Andrew Morton
Cc: Oleg Nesterov, roland, Eric W. Biederman, daniel, Containers,
linux-kernel
From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Date: Tue, 6 Jan 2009 17:28:05 -0800
Subject: [PATCH 3/7][v8] Add from_ancestor_ns parameter to send_signal()
send_signal() (or its helper) needs to determine the pid namespace
of the sender. But a signal sent via kill_pid_info_as_uid() comes
from within the kernel and send_signal() does not need to determine
the pid namespace of the sender. So define a helper for send_signal()
which takes an additional parameter, 'from_ancestor_ns' and have
kill_pid_info_as_uid() use that helper directly.
The 'from_ancestor_ns' parameter will be used in a follow-on patch.
Changelog[v6]:
- New patch added to this patchset, based on suggestions from
Roland McGrath and Oleg Nesterov.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
kernel/signal.c | 12 +++++++++---
1 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/kernel/signal.c b/kernel/signal.c
index 8761e4c..ff1dff9 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -823,8 +823,8 @@ static inline int legacy_queue(struct sigpending *signals, int sig)
return (sig < SIGRTMIN) && sigismember(&signals->signal, sig);
}
-static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
- int group)
+static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
+ int group, int from_ancestor_ns)
{
struct sigpending *pending;
struct sigqueue *q;
@@ -899,6 +899,12 @@ out_set:
return 0;
}
+static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
+ int group)
+{
+ return __send_signal(sig, info, t, group, 0);
+}
+
int print_fatal_signals;
static void print_fatal_signal(struct pt_regs *regs, int signr)
@@ -1143,7 +1149,7 @@ int kill_pid_info_as_uid(int sig, struct siginfo *info, struct pid *pid,
if (sig && p->sighand) {
unsigned long flags;
spin_lock_irqsave(&p->sighand->siglock, flags);
- ret = __group_send_sig_info(sig, info, p);
+ ret = __send_signal(sig, info, p, 1, 0);
spin_unlock_irqrestore(&p->sighand->siglock, flags);
}
out_unlock:
--
1.5.2.5
^ permalink raw reply related [flat|nested] 29+ messages in thread* [PATCH 4/7][v8] Protect cinit from unblocked SIG_DFL signals
2009-02-19 3:02 [PATCH 0/7][v8] Container-init signal semantics Sukadev Bhattiprolu
` (2 preceding siblings ...)
2009-02-19 3:06 ` [PATCH 3/7][v8] Add from_ancestor_ns parameter to send_signal() Sukadev Bhattiprolu
@ 2009-02-19 3:06 ` Sukadev Bhattiprolu
2009-02-19 3:07 ` [PATCH 5/7][v8] zap_pid_ns_process() should use force_sig() Sukadev Bhattiprolu
` (4 subsequent siblings)
8 siblings, 0 replies; 29+ messages in thread
From: Sukadev Bhattiprolu @ 2009-02-19 3:06 UTC (permalink / raw)
To: Andrew Morton
Cc: Oleg Nesterov, roland, Eric W. Biederman, daniel, Containers,
linux-kernel
From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Date: Wed, 24 Dec 2008 14:03:57 -0800
Subject: [PATCH 4/7][v8] Protect cinit from unblocked SIG_DFL signals
Drop early any SIG_DFL or SIG_IGN signals to container-init from within
the same container. But queue SIGSTOP and SIGKILL to the container-init
if they are from an ancestor container.
Blocked, fatal signals (i.e when SIG_DFL is to terminate) from within the
container can still terminate the container-init. That will be addressed
in the next patch.
Note: To be bisect-safe, SIGNAL_UNKILLABLE will be set for container-inits
in a follow-on patch. Until then, this patch is just a preparatory
step.
Changelog[v7]:
- (Oleg Nesterov) Bugfix: send_sigqueue() should use from_ancestor_ns=0
- siginfo_from_user(), siginfo_from_ancestor_ns() are needed in only
one place. Remove them and move their logic into send_signal().
Seems to make code more easier to read :-)
Changelog[v6]:
- (Roland McGrath) Remove unnecessary helper signal_task_unkillable()
and fold checks into sig_task_ignored().
Changelog[v4]:
- (Oleg Nesterov) Remove SIGNAL_UNKILLABLE_FROM_NS and rename
'same_ns' to 'from_ancestor_ns'.
- SIGNAL_UNKILLABLE is not yet set for container-inits (will be
set in follow-on patch).
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
kernel/signal.c | 28 +++++++++++++++++++---------
1 files changed, 19 insertions(+), 9 deletions(-)
diff --git a/kernel/signal.c b/kernel/signal.c
index ff1dff9..a3a1304 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -55,20 +55,21 @@ static int sig_handler_ignored(void __user *handler, int sig)
(handler == SIG_DFL && sig_kernel_ignore(sig));
}
-static int sig_task_ignored(struct task_struct *t, int sig)
+static int sig_task_ignored(struct task_struct *t, int sig,
+ int from_ancestor_ns)
{
void __user *handler;
handler = sig_handler(t, sig);
if (unlikely(t->signal->flags & SIGNAL_UNKILLABLE) &&
- handler == SIG_DFL)
+ handler == SIG_DFL && !from_ancestor_ns)
return 1;
return sig_handler_ignored(handler, sig);
}
-static int sig_ignored(struct task_struct *t, int sig)
+static int sig_ignored(struct task_struct *t, int sig, int from_ancestor_ns)
{
/*
* Blocked signals are never ignored, since the
@@ -78,7 +79,7 @@ static int sig_ignored(struct task_struct *t, int sig)
if (sigismember(&t->blocked, sig) || sigismember(&t->real_blocked, sig))
return 0;
- if (!sig_task_ignored(t, sig))
+ if (!sig_task_ignored(t, sig, from_ancestor_ns))
return 0;
/*
@@ -634,7 +635,7 @@ static int check_kill_permission(int sig, struct siginfo *info,
* Returns true if the signal should be actually delivered, otherwise
* it should be dropped.
*/
-static int prepare_signal(int sig, struct task_struct *p)
+static int prepare_signal(int sig, struct task_struct *p, int from_ancestor_ns)
{
struct signal_struct *signal = p->signal;
struct task_struct *t;
@@ -718,7 +719,7 @@ static int prepare_signal(int sig, struct task_struct *p)
}
}
- return !sig_ignored(p, sig);
+ return !sig_ignored(p, sig, from_ancestor_ns);
}
/*
@@ -832,7 +833,8 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
trace_sched_signal_send(sig, t);
assert_spin_locked(&t->sighand->siglock);
- if (!prepare_signal(sig, t))
+
+ if (!prepare_signal(sig, t, from_ancestor_ns))
return 0;
pending = group ? &t->signal->shared_pending : &t->pending;
@@ -902,7 +904,15 @@ out_set:
static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
int group)
{
- return __send_signal(sig, info, t, group, 0);
+ int from_ancestor_ns = 0;
+
+#ifdef CONFIG_PID_NS
+ if (!is_si_special(info) && SI_FROMUSER(info) &&
+ task_pid_nr_ns(current, task_active_pid_ns(t)) <= 0)
+ from_ancestor_ns = 1;
+#endif
+
+ return __send_signal(sig, info, t, group, from_ancestor_ns);
}
int print_fatal_signals;
@@ -1336,7 +1346,7 @@ int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group)
goto ret;
ret = 1; /* the signal is ignored */
- if (!prepare_signal(sig, t))
+ if (!prepare_signal(sig, t, 0))
goto out;
ret = 0;
--
1.5.2.5
^ permalink raw reply related [flat|nested] 29+ messages in thread* [PATCH 5/7][v8] zap_pid_ns_process() should use force_sig()
2009-02-19 3:02 [PATCH 0/7][v8] Container-init signal semantics Sukadev Bhattiprolu
` (3 preceding siblings ...)
2009-02-19 3:06 ` [PATCH 4/7][v8] Protect cinit from unblocked SIG_DFL signals Sukadev Bhattiprolu
@ 2009-02-19 3:07 ` Sukadev Bhattiprolu
2009-02-19 18:59 ` Oleg Nesterov
2009-02-19 3:07 ` [PATCH 6/7][v8] Protect cinit from blocked fatal signals Sukadev Bhattiprolu
` (3 subsequent siblings)
8 siblings, 1 reply; 29+ messages in thread
From: Sukadev Bhattiprolu @ 2009-02-19 3:07 UTC (permalink / raw)
To: Andrew Morton
Cc: Oleg Nesterov, roland, Eric W. Biederman, daniel, Containers,
linux-kernel
From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Date: Wed, 18 Feb 2009 15:12:30 -0800
Subject: [PATCH 5/7][v8] zap_pid_ns_process() should use force_sig()
send_signal() assumes that signals with SEND_SIG_PRIV are generated from
within the same namespace. So any nested container-init processes become
immune to the SIGKILL generated by kill_proc_info() in zap_pid_ns_processes().
Use force_sig() in zap_pid_ns_processes() instead - force_sig() clears the
SIGNAL_UNKILLABLE flag ensuring the signal is processed by container-inits.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
kernel/pid_namespace.c | 14 +++++++++++++-
1 files changed, 13 insertions(+), 1 deletions(-)
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index fab8ea8..33815ae 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -152,6 +152,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
{
int nr;
int rc;
+ struct task_struct *task;
/*
* The last thread in the cgroup-init thread group is terminating.
@@ -169,7 +170,18 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
read_lock(&tasklist_lock);
nr = next_pidmap(pid_ns, 1);
while (nr > 0) {
- kill_proc_info(SIGKILL, SEND_SIG_PRIV, nr);
+ rcu_read_lock();
+
+ /*
+ * Use force_sig() since it clears SIGNAL_UNKILLABLE ensuring
+ * any nested-container's init processes don't ignore the
+ * signal
+ */
+ task = pid_task(find_vpid(nr), PIDTYPE_PID);
+ force_sig(SIGKILL, task);
+
+ rcu_read_unlock();
+
nr = next_pidmap(pid_ns, nr);
}
read_unlock(&tasklist_lock);
--
1.5.2.5
^ permalink raw reply related [flat|nested] 29+ messages in thread* Re: [PATCH 5/7][v8] zap_pid_ns_process() should use force_sig()
2009-02-19 3:07 ` [PATCH 5/7][v8] zap_pid_ns_process() should use force_sig() Sukadev Bhattiprolu
@ 2009-02-19 18:59 ` Oleg Nesterov
2009-02-19 20:26 ` Sukadev Bhattiprolu
0 siblings, 1 reply; 29+ messages in thread
From: Oleg Nesterov @ 2009-02-19 18:59 UTC (permalink / raw)
To: Sukadev Bhattiprolu
Cc: Andrew Morton, roland, Eric W. Biederman, daniel, Containers,
linux-kernel
On 02/18, Sukadev Bhattiprolu wrote:
>
> read_lock(&tasklist_lock);
> nr = next_pidmap(pid_ns, 1);
> while (nr > 0) {
> - kill_proc_info(SIGKILL, SEND_SIG_PRIV, nr);
> + rcu_read_lock();
> +
> + /*
> + * Use force_sig() since it clears SIGNAL_UNKILLABLE ensuring
> + * any nested-container's init processes don't ignore the
> + * signal
> + */
> + task = pid_task(find_vpid(nr), PIDTYPE_PID);
> + force_sig(SIGKILL, task);
Shouldn't we check task != NULL ?
Oleg.
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 5/7][v8] zap_pid_ns_process() should use force_sig()
2009-02-19 18:59 ` Oleg Nesterov
@ 2009-02-19 20:26 ` Sukadev Bhattiprolu
0 siblings, 0 replies; 29+ messages in thread
From: Sukadev Bhattiprolu @ 2009-02-19 20:26 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, roland, Eric W. Biederman, daniel, Containers,
linux-kernel
Oleg Nesterov [oleg@redhat.com] wrote:
| On 02/18, Sukadev Bhattiprolu wrote:
| >
| > read_lock(&tasklist_lock);
| > nr = next_pidmap(pid_ns, 1);
| > while (nr > 0) {
| > - kill_proc_info(SIGKILL, SEND_SIG_PRIV, nr);
| > + rcu_read_lock();
| > +
| > + /*
| > + * Use force_sig() since it clears SIGNAL_UNKILLABLE ensuring
| > + * any nested-container's init processes don't ignore the
| > + * signal
| > + */
| > + task = pid_task(find_vpid(nr), PIDTYPE_PID);
| > + force_sig(SIGKILL, task);
|
| Shouldn't we check task != NULL ?
Yes. Here is the updated patch.
---
From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Date: Wed, 18 Feb 2009 15:12:30 -0800
Subject: [PATCH 5/7][v8] zap_pid_ns_process() should use force_sig()
send_signal() assumes that signals with SEND_SIG_PRIV are generated from
within the same namespace. So any nested container-init processes become
immune to the SIGKILL generated by kill_proc_info() in zap_pid_ns_processes().
Use force_sig() in zap_pid_ns_processes() instead - force_sig() clears the
SIGNAL_UNKILLABLE flag ensuring the signal is processed by container-inits.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
kernel/pid_namespace.c | 15 ++++++++++++++-
1 files changed, 14 insertions(+), 1 deletions(-)
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index fab8ea8..2d1001b 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -152,6 +152,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
{
int nr;
int rc;
+ struct task_struct *task;
/*
* The last thread in the cgroup-init thread group is terminating.
@@ -169,7 +170,19 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
read_lock(&tasklist_lock);
nr = next_pidmap(pid_ns, 1);
while (nr > 0) {
- kill_proc_info(SIGKILL, SEND_SIG_PRIV, nr);
+ rcu_read_lock();
+
+ /*
+ * Use force_sig() since it clears SIGNAL_UNKILLABLE ensuring
+ * any nested-container's init processes don't ignore the
+ * signal
+ */
+ task = pid_task(find_vpid(nr), PIDTYPE_PID);
+ if (task)
+ force_sig(SIGKILL, task);
+
+ rcu_read_unlock();
+
nr = next_pidmap(pid_ns, nr);
}
read_unlock(&tasklist_lock);
--
1.5.2.5
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 6/7][v8] Protect cinit from blocked fatal signals
2009-02-19 3:02 [PATCH 0/7][v8] Container-init signal semantics Sukadev Bhattiprolu
` (4 preceding siblings ...)
2009-02-19 3:07 ` [PATCH 5/7][v8] zap_pid_ns_process() should use force_sig() Sukadev Bhattiprolu
@ 2009-02-19 3:07 ` Sukadev Bhattiprolu
2009-02-19 3:07 ` [PATCH 7/7][v8] SI_USER: Masquerade si_pid when crossing pid ns boundary Sukadev Bhattiprolu
` (2 subsequent siblings)
8 siblings, 0 replies; 29+ messages in thread
From: Sukadev Bhattiprolu @ 2009-02-19 3:07 UTC (permalink / raw)
To: Andrew Morton
Cc: Oleg Nesterov, roland, Eric W. Biederman, daniel, Containers,
linux-kernel
From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Date: Wed, 24 Dec 2008 14:04:24 -0800
Subject: [PATCH 6/7][v8] Protect cinit from blocked fatal signals
Normally SIG_DFL signals to global and container-init are dropped early.
But if a signal is blocked when it is posted, we cannot drop the signal
since the receiver may install a handler before unblocking the signal.
Once this signal is queued however, the receiver container-init has
no way of knowing if the signal was sent from an ancestor or descendant
namespace. This patch ensures that contianer-init drops all SIG_DFL
signals in get_signal_to_deliver() except SIGKILL/SIGSTOP.
If SIGSTOP/SIGKILL originate from a descendant of container-init they
are never queued (i.e dropped in sig_ignored() in an earler patch).
If SIGSTOP/SIGKILL originate from parent namespace, the signal is queued
and container-init processes the signal.
IOW, if get_signal_to_deliver() sees a sig_kernel_only() signal for global
or container-init, the signal must have been generated internally or must
have come from an ancestor ns and we process the signal.
Further, the signal_group_exit() check was needed to cover the case of
a multi-threaded init sending SIGKILL to other threads when doing an
exit() or exec(). But since the new sig_kernel_only() check covers the
SIGKILL, the signal_group_exit() check is no longer needed and can be
removed.
Finally, now that we have all pieces in place, set SIGNAL_UNKILLABLE for
container-inits.
Changelog[v6]:
- Add a note regarding the signal_group_exit() in patch description.
Changelog[v5]:
- (Oleg Nesterov) Drop signal_unkillable(), simplify check in
get_signal_to_deliver() and drop check for signal_group_exit()
since it is covered by sig_kernel_only().
Changelog[v4]:
- Rename sig_unkillable() to unkillable_by_sig()
- Remove SIGNAL_UNKILLABLE_FROM_NS flag and simplify (Oleg Nesterov)
- Set SIGNAL_UNKILLABLE for container-init in this patch.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
kernel/fork.c | 2 ++
kernel/signal.c | 9 ++++++++-
2 files changed, 10 insertions(+), 1 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index 2b45bc5..ddfa413 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -841,6 +841,8 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
atomic_set(&sig->live, 1);
init_waitqueue_head(&sig->wait_chldexit);
sig->flags = 0;
+ if (clone_flags & CLONE_NEWPID)
+ sig->flags |= SIGNAL_UNKILLABLE;
sig->group_exit_code = 0;
sig->group_exit_task = NULL;
sig->group_stop_count = 0;
diff --git a/kernel/signal.c b/kernel/signal.c
index a3a1304..c94355b 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1862,9 +1862,16 @@ relock:
/*
* Global init gets no signals it doesn't want.
+ * Container-init gets no signals it doesn't want from same
+ * container.
+ *
+ * Note that if global/container-init sees a sig_kernel_only()
+ * signal here, the signal must have been generated internally
+ * or must have come from an ancestor namespace. In either
+ * case, the signal cannot be dropped.
*/
if (unlikely(signal->flags & SIGNAL_UNKILLABLE) &&
- !signal_group_exit(signal))
+ !sig_kernel_only(signr))
continue;
if (sig_kernel_stop(signr)) {
--
1.5.2.5
^ permalink raw reply related [flat|nested] 29+ messages in thread* [PATCH 7/7][v8] SI_USER: Masquerade si_pid when crossing pid ns boundary
2009-02-19 3:02 [PATCH 0/7][v8] Container-init signal semantics Sukadev Bhattiprolu
` (5 preceding siblings ...)
2009-02-19 3:07 ` [PATCH 6/7][v8] Protect cinit from blocked fatal signals Sukadev Bhattiprolu
@ 2009-02-19 3:07 ` Sukadev Bhattiprolu
2009-02-19 16:11 ` Eric W. Biederman
2009-02-19 14:59 ` [PATCH 0/7][v8] Container-init signal semantics Daniel Lezcano
2009-02-19 20:53 ` Oleg Nesterov
8 siblings, 1 reply; 29+ messages in thread
From: Sukadev Bhattiprolu @ 2009-02-19 3:07 UTC (permalink / raw)
To: Andrew Morton
Cc: Oleg Nesterov, roland, Eric W. Biederman, daniel, Containers,
linux-kernel
From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Date: Wed, 24 Dec 2008 14:14:18 -0800
Subject: [PATCH 7/7][v8] SI_USER: Masquerade si_pid when crossing pid ns boundary
When sending a signal to a descendant namespace, set ->si_pid to 0 since
the sender does not have a pid in the receiver's namespace.
Note:
- If rt_sigqueueinfo() sets si_code to SI_USER when sending a
signal across a pid namespace boundary, the value in ->si_pid
will be cleared to 0.
Changelog[v5]:
- (Oleg Nesterov) Address both sys_kill() and sys_tkill() cases
in send_signal() to simplify code (this drops patch 7/7 from
earlier version of patchset).
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
kernel/signal.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/kernel/signal.c b/kernel/signal.c
index c94355b..a416d77 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -883,6 +883,8 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
break;
default:
copy_siginfo(&q->info, info);
+ if (from_ancestor_ns)
+ q->info.si_pid = 0;
break;
}
} else if (!is_si_special(info)) {
--
1.5.2.5
^ permalink raw reply related [flat|nested] 29+ messages in thread* Re: [PATCH 7/7][v8] SI_USER: Masquerade si_pid when crossing pid ns boundary
2009-02-19 3:07 ` [PATCH 7/7][v8] SI_USER: Masquerade si_pid when crossing pid ns boundary Sukadev Bhattiprolu
@ 2009-02-19 16:11 ` Eric W. Biederman
2009-02-19 18:51 ` Oleg Nesterov
0 siblings, 1 reply; 29+ messages in thread
From: Eric W. Biederman @ 2009-02-19 16:11 UTC (permalink / raw)
To: Sukadev Bhattiprolu
Cc: Andrew Morton, Oleg Nesterov, roland, daniel, Containers,
linux-kernel
Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> writes:
> From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
> Date: Wed, 24 Dec 2008 14:14:18 -0800
> Subject: [PATCH 7/7][v8] SI_USER: Masquerade si_pid when crossing pid ns
> boundary
>
> When sending a signal to a descendant namespace, set ->si_pid to 0 since
> the sender does not have a pid in the receiver's namespace.
>
> Note:
> - If rt_sigqueueinfo() sets si_code to SI_USER when sending a
> signal across a pid namespace boundary, the value in ->si_pid
> will be cleared to 0.
>
> Changelog[v5]:
> - (Oleg Nesterov) Address both sys_kill() and sys_tkill() cases
> in send_signal() to simplify code (this drops patch 7/7 from
> earlier version of patchset).
>
> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
> ---
> kernel/signal.c | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index c94355b..a416d77 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -883,6 +883,8 @@ static int __send_signal(int sig, struct siginfo *info,
> struct task_struct *t,
> break;
> default:
> copy_siginfo(&q->info, info);
> + if (from_ancestor_ns)
> + q->info.si_pid = 0;
This is wrong. siginfo is a union and you need to inspect
code to see if si_pid is present in the current union.
Eric
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/7][v8] SI_USER: Masquerade si_pid when crossing pid ns boundary
2009-02-19 16:11 ` Eric W. Biederman
@ 2009-02-19 18:51 ` Oleg Nesterov
2009-02-19 22:18 ` Eric W. Biederman
0 siblings, 1 reply; 29+ messages in thread
From: Oleg Nesterov @ 2009-02-19 18:51 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Sukadev Bhattiprolu, Andrew Morton, roland, daniel, Containers,
linux-kernel
On 02/19, Eric W. Biederman wrote:
>
> Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> writes:
>
> > From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
> > Date: Wed, 24 Dec 2008 14:14:18 -0800
> > Subject: [PATCH 7/7][v8] SI_USER: Masquerade si_pid when crossing pid ns
> > boundary
> >
> > When sending a signal to a descendant namespace, set ->si_pid to 0 since
> > the sender does not have a pid in the receiver's namespace.
> >
> > Note:
> > - If rt_sigqueueinfo() sets si_code to SI_USER when sending a
> > signal across a pid namespace boundary, the value in ->si_pid
> > will be cleared to 0.
> >
> > Changelog[v5]:
> > - (Oleg Nesterov) Address both sys_kill() and sys_tkill() cases
> > in send_signal() to simplify code (this drops patch 7/7 from
> > earlier version of patchset).
> >
> > Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
> > ---
> > kernel/signal.c | 2 ++
> > 1 files changed, 2 insertions(+), 0 deletions(-)
> >
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index c94355b..a416d77 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -883,6 +883,8 @@ static int __send_signal(int sig, struct siginfo *info,
> > struct task_struct *t,
> > break;
> > default:
> > copy_siginfo(&q->info, info);
> > + if (from_ancestor_ns)
> > + q->info.si_pid = 0;
>
> This is wrong. siginfo is a union and you need to inspect
> code to see if si_pid is present in the current union.
SI_FROMUSER() == T, unless we have more (hopefully not) in-kernel
users which send SI_FROMUSER() signals, .si_pid must be valid?
kill_pid_info_as_uid() was the only known sender of SI_FROMUSER
signal, it was converted to use __send_signal(from_ancestor_ns => 0).
Oleg.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/7][v8] SI_USER: Masquerade si_pid when crossing pid ns boundary
2009-02-19 18:51 ` Oleg Nesterov
@ 2009-02-19 22:18 ` Eric W. Biederman
2009-02-19 22:31 ` Oleg Nesterov
0 siblings, 1 reply; 29+ messages in thread
From: Eric W. Biederman @ 2009-02-19 22:18 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Sukadev Bhattiprolu, Andrew Morton, roland, daniel, Containers,
linux-kernel
Oleg Nesterov <oleg@redhat.com> writes:
> On 02/19, Eric W. Biederman wrote:
>>
>> Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> writes:
>>
>> > From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
>> > Date: Wed, 24 Dec 2008 14:14:18 -0800
>> > Subject: [PATCH 7/7][v8] SI_USER: Masquerade si_pid when crossing pid ns
>> > boundary
>> >
>> > When sending a signal to a descendant namespace, set ->si_pid to 0 since
>> > the sender does not have a pid in the receiver's namespace.
>> >
>> > Note:
>> > - If rt_sigqueueinfo() sets si_code to SI_USER when sending a
>> > signal across a pid namespace boundary, the value in ->si_pid
>> > will be cleared to 0.
>> >
>> > Changelog[v5]:
>> > - (Oleg Nesterov) Address both sys_kill() and sys_tkill() cases
>> > in send_signal() to simplify code (this drops patch 7/7 from
>> > earlier version of patchset).
>> >
>> > Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
>> > ---
>> > kernel/signal.c | 2 ++
>> > 1 files changed, 2 insertions(+), 0 deletions(-)
>> >
>> > diff --git a/kernel/signal.c b/kernel/signal.c
>> > index c94355b..a416d77 100644
>> > --- a/kernel/signal.c
>> > +++ b/kernel/signal.c
>> > @@ -883,6 +883,8 @@ static int __send_signal(int sig, struct siginfo *info,
>> > struct task_struct *t,
>> > break;
>> > default:
>> > copy_siginfo(&q->info, info);
>> > + if (from_ancestor_ns)
>> > + q->info.si_pid = 0;
>>
>> This is wrong. siginfo is a union and you need to inspect
>> code to see if si_pid is present in the current union.
>
> SI_FROMUSER() == T, unless we have more (hopefully not) in-kernel
> users which send SI_FROMUSER() signals, .si_pid must be valid?
So the argument is that while things such as force_sig_info(SIGSEGV)
don't have a si_pid we don't care because from_ancestor_ns == 0.
Interesting. Then I don't know if we have any kernel senders
that cross the namespace boundaries.
That said I still object to this code.
sys_kill(-pgrp, SIGUSR1)
kill_something_info(SIGUSR1, &info, 0)
__kill_pgrp_info(SIGUSR1, &info task_pgrp(current))
group_send_sig_info(SIGUSR1, &info, tsk)
__group_send_sig_info(SIGUSR1, &info, tsk)
send_signal(SIGUSR1, &info, tsk, 1)
__send_signal(SIGUSR1, &info, tsk, 1)
Process groups and sessions can have processes in multiple pid
namespaces, which is very useful for not messing up your controlling
terminal.
In which case sys_kill cannot possibly set the si_pid value correct
and from_ancestor_ns is not enough either.
So I see two valid policies with setting si_pid. Push the work
out to the callers of send_signal (kill_pgrp in this case). And
know you have a valid set of siginfo values. Or handle the work
in send_signal.
Given that except for process groups we don't send the same siginfo
to multiple processes simply generating the right siginfo values
from the start appears easy enough.
I am not current with the current rule: the caller of send_signal will
do all of the work except for sometimes. I don't see how we can figure
out which code path has the bug in it with a rule like that.
Eric
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 7/7][v8] SI_USER: Masquerade si_pid when crossing pid ns boundary
2009-02-19 22:18 ` Eric W. Biederman
@ 2009-02-19 22:31 ` Oleg Nesterov
2009-02-19 23:21 ` Eric W. Biederman
0 siblings, 1 reply; 29+ messages in thread
From: Oleg Nesterov @ 2009-02-19 22:31 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Sukadev Bhattiprolu, Andrew Morton, roland, daniel, Containers,
linux-kernel
On 02/19, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
> >
> > SI_FROMUSER() == T, unless we have more (hopefully not) in-kernel
> > users which send SI_FROMUSER() signals, .si_pid must be valid?
>
> So the argument is that while things such as force_sig_info(SIGSEGV)
> don't have a si_pid we don't care because from_ancestor_ns == 0.
>
> Interesting. Then I don't know if we have any kernel senders
> that cross the namespace boundaries.
>
> That said I still object to this code.
>
> sys_kill(-pgrp, SIGUSR1)
> kill_something_info(SIGUSR1, &info, 0)
> __kill_pgrp_info(SIGUSR1, &info task_pgrp(current))
> group_send_sig_info(SIGUSR1, &info, tsk)
> __group_send_sig_info(SIGUSR1, &info, tsk)
> send_signal(SIGUSR1, &info, tsk, 1)
> __send_signal(SIGUSR1, &info, tsk, 1)
>
>
> Process groups and sessions can have processes in multiple pid
> namespaces, which is very useful for not messing up your controlling
> terminal.
>
> In which case sys_kill cannot possibly set the si_pid value correct
> and from_ancestor_ns is not enough either.
(I know, I shouldn't reply today because I am already sleeping ;)
Why? send_signal() should calculate the correct value of
from_parent and pass it to __send_signal(). If it is true, then
we clear .si_pid in the copied siginfo (which was already queued).
We don't mangle the original siginfo.
This happens for each process we send the signal.
Or I misunderstood you?
Oleg.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/7][v8] SI_USER: Masquerade si_pid when crossing pid ns boundary
2009-02-19 22:31 ` Oleg Nesterov
@ 2009-02-19 23:21 ` Eric W. Biederman
2009-02-19 23:51 ` Roland McGrath
2009-02-20 0:28 ` Oleg Nesterov
0 siblings, 2 replies; 29+ messages in thread
From: Eric W. Biederman @ 2009-02-19 23:21 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Sukadev Bhattiprolu, Andrew Morton, roland, daniel, Containers,
linux-kernel
Oleg Nesterov <oleg@redhat.com> writes:
> On 02/19, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@redhat.com> writes:
>> >
>> > SI_FROMUSER() == T, unless we have more (hopefully not) in-kernel
>> > users which send SI_FROMUSER() signals, .si_pid must be valid?
>>
>> So the argument is that while things such as force_sig_info(SIGSEGV)
>> don't have a si_pid we don't care because from_ancestor_ns == 0.
>>
>> Interesting. Then I don't know if we have any kernel senders
>> that cross the namespace boundaries.
>>
>> That said I still object to this code.
>>
>> sys_kill(-pgrp, SIGUSR1)
>> kill_something_info(SIGUSR1, &info, 0)
>> __kill_pgrp_info(SIGUSR1, &info task_pgrp(current))
>> group_send_sig_info(SIGUSR1, &info, tsk)
>> __group_send_sig_info(SIGUSR1, &info, tsk)
>> send_signal(SIGUSR1, &info, tsk, 1)
>> __send_signal(SIGUSR1, &info, tsk, 1)
>>
>>
>> Process groups and sessions can have processes in multiple pid
>> namespaces, which is very useful for not messing up your controlling
>> terminal.
>>
>> In which case sys_kill cannot possibly set the si_pid value correct
>> and from_ancestor_ns is not enough either.
>
> (I know, I shouldn't reply today because I am already sleeping ;)
>
> Why? send_signal() should calculate the correct value of
> from_parent and pass it to __send_signal(). If it is true, then
> we clear .si_pid in the copied siginfo (which was already queued).
> We don't mangle the original siginfo.
>
> This happens for each process we send the signal.
>
> Or I misunderstood you?
Suppose I have 3 processes in a process group in three separate pid
namespaces.
Looking from the init pid namespace I have:
pid pgrp ppid
10 10 1
11 10 10
12 10 11
Looking from the pid namespace of pid 11 I have:
pid pgrp ppid
0 0 0
1 0 0
2 0 1
Looking from the pid namespace of pid 12 I have:
pid pgrp ppid
0 0 0
0 0 0
1 0 0
So if the process with pid 12 in the initial pid namespace
sends to process group 0.
pid 10 should see si_pid 12.
pid 11 should see si_pid 2.
Neither should see si_pid 0, as from_ancestor_ns will not
be true.
Eric
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 7/7][v8] SI_USER: Masquerade si_pid when crossing pid ns boundary
2009-02-19 23:21 ` Eric W. Biederman
@ 2009-02-19 23:51 ` Roland McGrath
2009-02-20 0:35 ` Eric W. Biederman
2009-02-20 0:28 ` Oleg Nesterov
1 sibling, 1 reply; 29+ messages in thread
From: Roland McGrath @ 2009-02-19 23:51 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Oleg Nesterov, Sukadev Bhattiprolu, Andrew Morton, daniel,
Containers, linux-kernel
> Suppose I have 3 processes in a process group in three separate pid
> namespaces.
>
> Looking from the init pid namespace I have:
> pid pgrp ppid
> 10 10 1
> 11 10 10
> 12 10 11
>
> Looking from the pid namespace of pid 11 I have:
> pid pgrp ppid
> 0 0 0
> 1 0 0
> 2 0 1
>
> Looking from the pid namespace of pid 12 I have:
> pid pgrp ppid
> 0 0 0
> 0 0 0
> 1 0 0
>
> So if the process with pid 12 in the initial pid namespace
> sends to process group 0.
There is no "process group 0". 0 means "the sender's pgrp".
One possibility is that perhaps what people really want the pid_ns to mean
is that "the sender's pgrp" in the view of the sender does not include any
processes outside its pid_ns scope. That would be consistent with the
behavior of kill (kill_something_info) on -1; it's described as "all
processes", but in fact means "all processes within my pid_ns scope".
What I mean to describe there is changing kill_something_info, so that
e.g. killpg() inside the NS would affect only the NS init itself but e.g.
^Z (effectively an implicit killpg() that's always from the global NS)
would also go to that init's "mother" pgrp in the outer NS.
Another possibility is to decide that's just not worth having at all, and
CLONE_NEWNS should just implicitly reset pgrp to self. That is simple.
But perhaps today someone has a script running a pid_ns-world whose init is
gracefully killed by ^C of the whole script and we wouldn't want to break
that if it is actually useful now.
> pid 10 should see si_pid 12.
> pid 11 should see si_pid 2.
We indeed have this problem if we think it's useful to continue to have
a concept of pgrp for the sub-init that can see outside its own NS.
> Neither should see si_pid 0, as from_ancestor_ns will not be true.
Perhaps replace from_ancestor_ns with struct pid_namespace *sender_ns?
(I don't know if there was already a can of worms with such an idea before.)
Then si_pid could be translated as appropriate for each recipient.
(Or perhaps just struct pid *sender and reset si_pid from that.)
Thanks,
Roland
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/7][v8] SI_USER: Masquerade si_pid when crossing pid ns boundary
2009-02-19 23:51 ` Roland McGrath
@ 2009-02-20 0:35 ` Eric W. Biederman
2009-02-20 1:06 ` Roland McGrath
0 siblings, 1 reply; 29+ messages in thread
From: Eric W. Biederman @ 2009-02-20 0:35 UTC (permalink / raw)
To: Roland McGrath
Cc: Oleg Nesterov, Sukadev Bhattiprolu, Andrew Morton, daniel,
Containers, linux-kernel
Roland McGrath <roland@redhat.com> writes:
>> Suppose I have 3 processes in a process group in three separate pid
>> namespaces.
>>
>> Looking from the init pid namespace I have:
>> pid pgrp ppid
>> 10 10 1
>> 11 10 10
>> 12 10 11
>>
>> Looking from the pid namespace of pid 11 I have:
>> pid pgrp ppid
>> 0 0 0
>> 1 0 0
>> 2 0 1
>>
>> Looking from the pid namespace of pid 12 I have:
>> pid pgrp ppid
>> 0 0 0
>> 0 0 0
>> 1 0 0
>>
>> So if the process with pid 12 in the initial pid namespace
>> sends to process group 0.
>
> There is no "process group 0". 0 means "the sender's pgrp".
Exactly. It just happens in this case that pid_nr_ns returns 0 for
the process group number as well as the process group the process is a
member of, that was created outside of the current pid namespace.
> One possibility is that perhaps what people really want the pid_ns to mean
> is that "the sender's pgrp" in the view of the sender does not include any
> processes outside its pid_ns scope. That would be consistent with the
> behavior of kill (kill_something_info) on -1; it's described as "all
> processes", but in fact means "all processes within my pid_ns scope".
>
> What I mean to describe there is changing kill_something_info, so that
> e.g. killpg() inside the NS would affect only the NS init itself but e.g.
> ^Z (effectively an implicit killpg() that's always from the global NS)
> would also go to that init's "mother" pgrp in the outer NS.
> Another possibility is to decide that's just not worth having at all, and
> CLONE_NEWNS should just implicitly reset pgrp to self. That is simple.
> But perhaps today someone has a script running a pid_ns-world whose init is
> gracefully killed by ^C of the whole script and we wouldn't want to break
> that if it is actually useful now.
It is especially useful, and this is a deliberate feature. Having
sessions and process groups extend across pid namespace borders means
you can share a tty and job control functions correctly. Very handy
for circumstances where you want a light weight temporary container,
and something I am actively using today. The practical benefit is
that you can upgrade from situations where you would previous use
chroot without extra hassle.
In practice I don't care about si_pid and I doubt I care about processes
sending signals outside of their pid namespace. But I do care about
sharing a tty and a session and having job control work.
>> pid 10 should see si_pid 12.
>> pid 11 should see si_pid 2.
>
> We indeed have this problem if we think it's useful to continue to have
> a concept of pgrp for the sub-init that can see outside its own NS.
>
>> Neither should see si_pid 0, as from_ancestor_ns will not be true.
>
> Perhaps replace from_ancestor_ns with struct pid_namespace *sender_ns?
> (I don't know if there was already a can of worms with such an idea before.)
> Then si_pid could be translated as appropriate for each recipient.
> (Or perhaps just struct pid *sender and reset si_pid from that.)
The last was my original line of thinking. I seem to recall Oleg
figuring the code gets pretty ugly when you add in the necessary test
to see if si_pid is actually present.
There are several other cases where we also signal a process outside
of our current pid namespace, where we have a pid inside the recipients
pid namespace. do_notify_parent is the easiest example. However those
cases can get the value right because they are unicast signals and
know their recipient when the set the si_pid originally.
My current line of thinking is either:
a) We pass in struct pid *sender and we reset si_pid in send_signal.
b) We make the rule that send_signal must receive a valid siginfo from
the caller and we only do the extra work for process groups.
Eric
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/7][v8] SI_USER: Masquerade si_pid when crossing pid ns boundary
2009-02-20 0:35 ` Eric W. Biederman
@ 2009-02-20 1:06 ` Roland McGrath
2009-02-20 2:12 ` Eric W. Biederman
0 siblings, 1 reply; 29+ messages in thread
From: Roland McGrath @ 2009-02-20 1:06 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Oleg Nesterov, Sukadev Bhattiprolu, Andrew Morton, daniel,
Containers, linux-kernel
> It is especially useful, and this is a deliberate feature.
Ok, I thought that might be so.
> In practice I don't care about si_pid and I doubt I care about processes
> sending signals outside of their pid namespace. But I do care about
> sharing a tty and a session and having job control work.
Understood.
> >> pid 10 should see si_pid 12.
> >> pid 11 should see si_pid 2.
> >
> > We indeed have this problem if we think it's useful to continue to have
> > a concept of pgrp for the sub-init that can see outside its own NS.
> >
> >> Neither should see si_pid 0, as from_ancestor_ns will not be true.
> >
> > Perhaps replace from_ancestor_ns with struct pid_namespace *sender_ns?
> > (I don't know if there was already a can of worms with such an idea before.)
> > Then si_pid could be translated as appropriate for each recipient.
> > (Or perhaps just struct pid *sender and reset si_pid from that.)
>
> The last was my original line of thinking. I seem to recall Oleg
> figuring the code gets pretty ugly when you add in the necessary test
> to see if si_pid is actually present.
Well, the existing test to set from_ancestor_ns is in one place and we
think that its logic is OK. What I had in mind was that when that logic
says "0", we pass NULL and the innards don't touch .si_pid (same as now);
when it says "1", we pass a pointer and the innards do rewrite it.
> There are several other cases where we also signal a process outside
> of our current pid namespace, where we have a pid inside the recipients
> pid namespace. do_notify_parent is the easiest example.
It's the only example that Oleg has mentioned. What others are there?
> a) We pass in struct pid *sender and we reset si_pid in send_signal.
> b) We make the rule that send_signal must receive a valid siginfo from
> the caller and we only do the extra work for process groups.
That's what I said. ;-) The a) option seems cleaner to me regardless, to
the extent that the "from_ancestor_ns" approach is a "clean" one. But I
think it would be best to fully elucidate what we think about desireable
semantics for the whole spectrum of cross-NS signal-sending cases before
actually choosing the implementation details.
Thanks,
Roland
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/7][v8] SI_USER: Masquerade si_pid when crossing pid ns boundary
2009-02-20 1:06 ` Roland McGrath
@ 2009-02-20 2:12 ` Eric W. Biederman
2009-02-20 3:10 ` Roland McGrath
0 siblings, 1 reply; 29+ messages in thread
From: Eric W. Biederman @ 2009-02-20 2:12 UTC (permalink / raw)
To: Roland McGrath
Cc: Oleg Nesterov, Sukadev Bhattiprolu, Andrew Morton, daniel,
Containers, linux-kernel
Roland McGrath <roland@redhat.com> writes:
>> It is especially useful, and this is a deliberate feature.
>
> Ok, I thought that might be so.
>
>> In practice I don't care about si_pid and I doubt I care about processes
>> sending signals outside of their pid namespace. But I do care about
>> sharing a tty and a session and having job control work.
>
> Understood.
>
>> >> pid 10 should see si_pid 12.
>> >> pid 11 should see si_pid 2.
>> >
>> > We indeed have this problem if we think it's useful to continue to have
>> > a concept of pgrp for the sub-init that can see outside its own NS.
>> >
>> >> Neither should see si_pid 0, as from_ancestor_ns will not be true.
>> >
>> > Perhaps replace from_ancestor_ns with struct pid_namespace *sender_ns?
>> > (I don't know if there was already a can of worms with such an idea before.)
>> > Then si_pid could be translated as appropriate for each recipient.
>> > (Or perhaps just struct pid *sender and reset si_pid from that.)
>>
>> The last was my original line of thinking. I seem to recall Oleg
>> figuring the code gets pretty ugly when you add in the necessary test
>> to see if si_pid is actually present.
>
> Well, the existing test to set from_ancestor_ns is in one place and we
> think that its logic is OK. What I had in mind was that when that logic
> says "0", we pass NULL and the innards don't touch .si_pid (same as now);
> when it says "1", we pass a pointer and the innards do rewrite it.
>
>> There are several other cases where we also signal a process outside
>> of our current pid namespace, where we have a pid inside the recipients
>> pid namespace. do_notify_parent is the easiest example.
>
> It's the only example that Oleg has mentioned. What others are there?
>
>> a) We pass in struct pid *sender and we reset si_pid in send_signal.
>> b) We make the rule that send_signal must receive a valid siginfo from
>> the caller and we only do the extra work for process groups.
>
> That's what I said. ;-) The a) option seems cleaner to me regardless, to
> the extent that the "from_ancestor_ns" approach is a "clean" one. But I
> think it would be best to fully elucidate what we think about desireable
> semantics for the whole spectrum of cross-NS signal-sending cases before
> actually choosing the implementation details.
With respect to si_pid I think the value should be:
pid_nr_ns(sender, task_active_pid_ns(tsk));
With respect to init receiving signals.
- SIGKILL and SIGSTOP are ignored not coming from an ancestor pid
namespace.
For the other signals since init can set it's handlers I really
don't care how we handle them. If we treat them all normally
/sbin/init should work fine.
For signals from the kernel I'm inclined to always call those from an
ancestor pid namespace. It seems we don't have any legitimate reason
to special case signals ignoring that we either asked for or that if
we don't handle will cause an infinite signal handling loop in
/sbin/init.
But as long as we get si_pid set correctly and we get handling of SIGKILL
and SIGSTOP correct. I'm happy.
Eric
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 7/7][v8] SI_USER: Masquerade si_pid when crossing pid ns boundary
2009-02-20 2:12 ` Eric W. Biederman
@ 2009-02-20 3:10 ` Roland McGrath
2009-02-20 4:05 ` Eric W. Biederman
0 siblings, 1 reply; 29+ messages in thread
From: Roland McGrath @ 2009-02-20 3:10 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Oleg Nesterov, Sukadev Bhattiprolu, Andrew Morton, daniel,
Containers, linux-kernel
> > think it would be best to fully elucidate what we think about desireable
> > semantics for the whole spectrum of cross-NS signal-sending cases before
> > actually choosing the implementation details.
... and then you answered all the questions that are already well settled,
and did not address the new question that you had raised earlier today.
To which processes should a pgrp-wide signal sent from user mode inside a
pid_ns go? Should they go to a pgrp member in a different pid_ns, or not?
If your answer is that you don't care, my inclination is to leave it as it
is ("my pgrp" can include processes outside your pid_ns, which you could
not explicitly target in any other way). The way we are going just for the
sake of cleanliness happens to make the si_pid values all work out right
for this. Possibly the semantics are even what you want: If e.g. the
sub-init acts like many terminal apps and might use the tty in raw mode but
then handle something like ^Z by fiddling the tty and then kill(0,SIGTSTP)
to act like ^Z was hit in cooked mode, then this preserves the proper
effect of that suspending a whole script/pipeline.
> There are cases that happen, and it very much simplifies dealing with
> tty's if we allow it.
I don't call tty signals "sending a signal to outside the PID namespace".
I suspect Oleg did not intend it to be read that way either. To me, a tty
signal comes from "the device"--not from some process--and I presume all
such sources we'd consider to "live in the global namespace" (really they
just don't exist in any PID namespace, they aren't processes). So I don't
think Oleg's comment is meant to apply to such cases at all.
> Another case where we can send signals between namespaces is posix
> message queues. Implemented in ipc/mqueue.c. In that case because it
> is a unicast message we are generating the proper si_pid when we
> generate the signal.
Ah, this is the clear example of "any to any", since all the sender and
recipient have to share is the mqueue they each have a descriptor on.
But, as you say, it's got no problems because the sender is just
"current in mq_timedsend" to a single recipient, no different than
"current in sys_kill" when that is going to a single recipient.
> I think that is where we need to go, to be safe and to be certain
> weird things won't sneak up on us. We already handle half of the logic in
> send_signal anyway. We might as well handle the other half.
Agreed.
Thanks,
Roland
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 7/7][v8] SI_USER: Masquerade si_pid when crossing pid ns boundary
2009-02-20 3:10 ` Roland McGrath
@ 2009-02-20 4:05 ` Eric W. Biederman
0 siblings, 0 replies; 29+ messages in thread
From: Eric W. Biederman @ 2009-02-20 4:05 UTC (permalink / raw)
To: Roland McGrath
Cc: Oleg Nesterov, Sukadev Bhattiprolu, Andrew Morton, daniel,
Containers, linux-kernel
Roland McGrath <roland@redhat.com> writes:
>> > think it would be best to fully elucidate what we think about desireable
>> > semantics for the whole spectrum of cross-NS signal-sending cases before
>> > actually choosing the implementation details.
>
> ... and then you answered all the questions that are already well settled,
> and did not address the new question that you had raised earlier today.
Oh sorry. I misunderstood what you were asking.
> To which processes should a pgrp-wide signal sent from user mode inside a
> pid_ns go? Should they go to a pgrp member in a different pid_ns, or not?
>
> If your answer is that you don't care, my inclination is to leave it as it
> is ("my pgrp" can include processes outside your pid_ns, which you could
> not explicitly target in any other way). The way we are going just for the
> sake of cleanliness happens to make the si_pid values all work out right
> for this. Possibly the semantics are even what you want: If e.g. the
> sub-init acts like many terminal apps and might use the tty in raw mode but
> then handle something like ^Z by fiddling the tty and then kill(0,SIGTSTP)
> to act like ^Z was hit in cooked mode, then this preserves the proper
> effect of that suspending a whole script/pipeline.
As it is are the easiest and most intuitive semantics to me. It is
simply weird to people expecting that signals will never exit out of a
pid namespace. Additionally I like having a prominent easy to create
case because it makes it much easier for people to realize it can
happen.
What I don't have is a compelling usage that means we must send to
every process in our process group if our process group spans multiple
pid namespaces. Your description of manually implementing ^Z sounds
as close as I can come to a compelling case.
I simply have a compelling case for process groups and sessions that
span pid namespaces where the tty sends the signals to all of the
processes.
A nearly compelling case for the current process group semantics
and comes from using pid namespaces as inescapable process groups.
In that use case I would find it very convenient to be able to
set SIGCHLD to SIG_IGN and exec an arbitrary program a pid namespace
leader.
Ouch!
I have just recalled a use case that will cause problems with the
current ignoring of signals in this patchset. Currently a container
init can not send SIGSTOP to itself. And I have been taking advantage
of that in usages such as supporting the bash suspend command or the
M-x suspend-emacs. And it is very handy for getting back to a shell
outside of a chroot like container. SIGTSTP will still work, but
SIGSTOP which I'm pretty certain bash sends itself will not.
So I have the question. How few special cases do we need to implement
to signal handling in a container init and still support running
programs written to be /sbin/init, on linux. Can we limit this our
special case to just ignoring SIGKILL and SIGSTOP when sent from other
process in the same pid namespace? Or do we actually need more?
>> Another case where we can send signals between namespaces is posix
>> message queues. Implemented in ipc/mqueue.c. In that case because it
>> is a unicast message we are generating the proper si_pid when we
>> generate the signal.
>
> Ah, this is the clear example of "any to any", since all the sender and
> recipient have to share is the mqueue they each have a descriptor on.
> But, as you say, it's got no problems because the sender is just
> "current in mq_timedsend" to a single recipient, no different than
> "current in sys_kill" when that is going to a single recipient.
I suspect there are others buried in the kernel somewhere or there
will be others in the future. We have a very similar pattern with
fcntl and SIGIO and SIGURG, but they all look they are coming from the
kernel. Everything except the tty code appears to slowly approach
the general case.
>> I think that is where we need to go, to be safe and to be certain
>> weird things won't sneak up on us. We already handle half of the logic in
>> send_signal anyway. We might as well handle the other half.
>
> Agreed.
Eric
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/7][v8] SI_USER: Masquerade si_pid when crossing pid ns boundary
2009-02-19 23:21 ` Eric W. Biederman
2009-02-19 23:51 ` Roland McGrath
@ 2009-02-20 0:28 ` Oleg Nesterov
2009-02-20 1:16 ` Eric W. Biederman
1 sibling, 1 reply; 29+ messages in thread
From: Oleg Nesterov @ 2009-02-20 0:28 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Sukadev Bhattiprolu, Andrew Morton, roland, daniel, Containers,
linux-kernel
On 02/19, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > On 02/19, Eric W. Biederman wrote:
> >>
> >> Oleg Nesterov <oleg@redhat.com> writes:
> >> >
> >> > SI_FROMUSER() == T, unless we have more (hopefully not) in-kernel
> >> > users which send SI_FROMUSER() signals, .si_pid must be valid?
> >>
> >> So the argument is that while things such as force_sig_info(SIGSEGV)
> >> don't have a si_pid we don't care because from_ancestor_ns == 0.
> >>
> >> Interesting. Then I don't know if we have any kernel senders
> >> that cross the namespace boundaries.
> >>
> >> That said I still object to this code.
> >>
> >> sys_kill(-pgrp, SIGUSR1)
> >> kill_something_info(SIGUSR1, &info, 0)
> >> __kill_pgrp_info(SIGUSR1, &info task_pgrp(current))
> >> group_send_sig_info(SIGUSR1, &info, tsk)
> >> __group_send_sig_info(SIGUSR1, &info, tsk)
> >> send_signal(SIGUSR1, &info, tsk, 1)
> >> __send_signal(SIGUSR1, &info, tsk, 1)
> >>
> >>
> >> Process groups and sessions can have processes in multiple pid
> >> namespaces, which is very useful for not messing up your controlling
> >> terminal.
> >>
> >> In which case sys_kill cannot possibly set the si_pid value correct
> >> and from_ancestor_ns is not enough either.
> >
> > (I know, I shouldn't reply today because I am already sleeping ;)
> >
> > Why? send_signal() should calculate the correct value of
> > from_parent and pass it to __send_signal(). If it is true, then
> > we clear .si_pid in the copied siginfo (which was already queued).
> > We don't mangle the original siginfo.
> >
> > This happens for each process we send the signal.
> >
> > Or I misunderstood you?
>
> Suppose I have 3 processes in a process group in three separate pid
> namespaces.
>
> Looking from the init pid namespace I have:
> pid pgrp ppid
> 10 10 1
> 11 10 10
> 12 10 11
>
> Looking from the pid namespace of pid 11 I have:
> pid pgrp ppid
> 0 0 0
> 1 0 0
> 2 0 1
>
> Looking from the pid namespace of pid 12 I have:
> pid pgrp ppid
> 0 0 0
> 0 0 0
> 1 0 0
>
> So if the process with pid 12 in the initial pid namespace
> sends to process group 0.
But this is the different problem, it is not that we clear si_pid while
we shouldn't, just the .si_pid passed from kill_something_info() is not
right.
Personally, I think we should not allow to send signals outside our
namespace (except SIGCHLD on exit), this looks just wrong to me. And
some time ago copy_process(CLONE_PID) did "setsid".
Hmm... that was changed by your commit 5cd17569fd0eeca510735e63a6061291e3971bf6.
And while I agree with this commit, I think that cinit should do sys_setsid()
itself to detach itself from the parent namespace.
Or. We can fix the case you described. We can move "si_pid = task_tgid_vnr()"
from sys_kill/do_tkill/etc to send_signal(), it can calculate the correct
.si_pid looking at sender/receiver namespaces.
Oleg.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 7/7][v8] SI_USER: Masquerade si_pid when crossing pid ns boundary
2009-02-20 0:28 ` Oleg Nesterov
@ 2009-02-20 1:16 ` Eric W. Biederman
0 siblings, 0 replies; 29+ messages in thread
From: Eric W. Biederman @ 2009-02-20 1:16 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Sukadev Bhattiprolu, Andrew Morton, roland, daniel, Containers,
linux-kernel
Oleg Nesterov <oleg@redhat.com> writes:
> On 02/19, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@redhat.com> writes:
>>
>> > On 02/19, Eric W. Biederman wrote:
>> >>
>> >> Oleg Nesterov <oleg@redhat.com> writes:
>> >> >
>> >> > SI_FROMUSER() == T, unless we have more (hopefully not) in-kernel
>> >> > users which send SI_FROMUSER() signals, .si_pid must be valid?
>> >>
>> >> So the argument is that while things such as force_sig_info(SIGSEGV)
>> >> don't have a si_pid we don't care because from_ancestor_ns == 0.
>> >>
>> >> Interesting. Then I don't know if we have any kernel senders
>> >> that cross the namespace boundaries.
>> >>
>> >> That said I still object to this code.
>> >>
>> >> sys_kill(-pgrp, SIGUSR1)
>> >> kill_something_info(SIGUSR1, &info, 0)
>> >> __kill_pgrp_info(SIGUSR1, &info task_pgrp(current))
>> >> group_send_sig_info(SIGUSR1, &info, tsk)
>> >> __group_send_sig_info(SIGUSR1, &info, tsk)
>> >> send_signal(SIGUSR1, &info, tsk, 1)
>> >> __send_signal(SIGUSR1, &info, tsk, 1)
>> >>
>> >>
>> >> Process groups and sessions can have processes in multiple pid
>> >> namespaces, which is very useful for not messing up your controlling
>> >> terminal.
>> >>
>> >> In which case sys_kill cannot possibly set the si_pid value correct
>> >> and from_ancestor_ns is not enough either.
>> >
>> > (I know, I shouldn't reply today because I am already sleeping ;)
>> >
>> > Why? send_signal() should calculate the correct value of
>> > from_parent and pass it to __send_signal(). If it is true, then
>> > we clear .si_pid in the copied siginfo (which was already queued).
>> > We don't mangle the original siginfo.
>> >
>> > This happens for each process we send the signal.
>> >
>> > Or I misunderstood you?
>>
>> Suppose I have 3 processes in a process group in three separate pid
>> namespaces.
>>
>> Looking from the init pid namespace I have:
>> pid pgrp ppid
>> 10 10 1
>> 11 10 10
>> 12 10 11
>>
>> Looking from the pid namespace of pid 11 I have:
>> pid pgrp ppid
>> 0 0 0
>> 1 0 0
>> 2 0 1
>>
>> Looking from the pid namespace of pid 12 I have:
>> pid pgrp ppid
>> 0 0 0
>> 0 0 0
>> 1 0 0
>>
>> So if the process with pid 12 in the initial pid namespace
>> sends to process group 0.
>
> But this is the different problem, it is not that we clear si_pid while
> we shouldn't, just the .si_pid passed from kill_something_info() is not
> right.
>
> Personally, I think we should not allow to send signals outside our
> namespace (except SIGCHLD on exit), this looks just wrong to me. And
> some time ago copy_process(CLONE_PID) did "setsid".
There are cases that happen, and it very much simplifies dealing with
tty's if we allow it.
Another case where we can send signals between namespaces is posix
message queues. Implemented in ipc/mqueue.c. In that case because it
is a unicast message we are generating the proper si_pid when we
generate the signal.
When programmers get lazy it is all to easy to have signals that are
useful and cross namespace boundaries. In practice it is possible to
close those holes if don't want the confusion in your userspace
programs. For the people who take advantage of the flexibility of
namespace to mix things together I don't think we can get away in the
kernel assuming weird things won't happen.
> Hmm... that was changed by your commit 5cd17569fd0eeca510735e63a6061291e3971bf6.
> And while I agree with this commit, I think that cinit should do sys_setsid()
> itself to detach itself from the parent namespace.
In general that should indeed happen. There are specific cases where it is
advantageous not to call setsid(). It is all under control of the creator
of the pid namespace.
> Or. We can fix the case you described. We can move "si_pid = task_tgid_vnr()"
> from sys_kill/do_tkill/etc to send_signal(), it can calculate the correct
> .si_pid looking at sender/receiver namespaces.
I think that is where we need to go, to be safe and to be certain
weird things won't sneak up on us. We already handle half of the logic in
send_signal anyway. We might as well handle the other half.
Eric
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/7][v8] Container-init signal semantics
2009-02-19 3:02 [PATCH 0/7][v8] Container-init signal semantics Sukadev Bhattiprolu
` (6 preceding siblings ...)
2009-02-19 3:07 ` [PATCH 7/7][v8] SI_USER: Masquerade si_pid when crossing pid ns boundary Sukadev Bhattiprolu
@ 2009-02-19 14:59 ` Daniel Lezcano
2009-03-07 19:04 ` Sukadev Bhattiprolu
2009-02-19 20:53 ` Oleg Nesterov
8 siblings, 1 reply; 29+ messages in thread
From: Daniel Lezcano @ 2009-02-19 14:59 UTC (permalink / raw)
To: Sukadev Bhattiprolu
Cc: Andrew Morton, linux-kernel, Eric W. Biederman, Containers,
Oleg Nesterov, roland, Greg Kurz
Sukadev Bhattiprolu wrote:
> Patch 5/7 is new in this set and fixes a bug. Remaining patches are
> just a forward-port from previous version and I believe they address
> all comments I have received.
>
> Oleg please sign-off/ack if you agree.
>
> ---
>
> Container-init must behave like global-init to processes within the
> container and hence it must be immune to unhandled fatal signals from
> within the container (i.e SIG_DFL signals that terminate the process).
>
> But the same container-init must behave like a normal process to
> processes in ancestor namespaces and so if it receives the same fatal
> signal from a process in ancestor namespace, the signal must be
> processed.
>
> Implementing these semantics requires that send_signal() determine pid
> namespace of the sender but since signals can originate from workqueues/
> interrupt-handlers, determining pid namespace of sender may not always
> be possible or safe.
>
> This patchset implements the design/simplified semantics suggested by
> Oleg Nesterov. The simplified semantics for container-init are:
>
> - container-init must never be terminated by a signal from a
> descendant process.
>
> - container-init must never be immune to SIGKILL from an ancestor
> namespace (so a process in parent namespace must always be able
> to terminate a descendant container).
>
> - container-init may be immune to unhandled fatal signals (like
> SIGUSR1) even if they are from ancestor namespace. SIGKILL/SIGSTOP
> are the only reliable signals to a container-init from ancestor
> namespace.
>
Hi Suka,
I agree with these semantics, they look good.
What is planned to have the init process to die when a system container
shuts down ?
Let's say we use the "shutdown" command, it will telinit to go to the
runlevel 0, and will kill -1.
At this point, the container finishes with a sys_reboot (we take care to
do nothing otherwise the real system shuts down). But the init process
will stay there and the launcher of the container will never know if the
container has stopped or not.
Gregory Kurz proposed a solution:
* when shutdown is called and we are not in the init pidns, then we
kill the process 1 of the pidnamespace.
* when reboot is called and we are not in the init pidns, then we
reexec the init process, using the same command line. I guess this one
could be easily retrieved if we are able to display /proc/1/cmdline ;)
IMHO, this is a good proposition because it is generic and intuitive, no ?
What do you thing ?
-- Daniel
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 0/7][v8] Container-init signal semantics
2009-02-19 14:59 ` [PATCH 0/7][v8] Container-init signal semantics Daniel Lezcano
@ 2009-03-07 19:04 ` Sukadev Bhattiprolu
2009-03-07 19:43 ` Daniel Lezcano
0 siblings, 1 reply; 29+ messages in thread
From: Sukadev Bhattiprolu @ 2009-03-07 19:04 UTC (permalink / raw)
To: Daniel Lezcano
Cc: Andrew Morton, linux-kernel, Eric W. Biederman, Containers,
Oleg Nesterov, roland, Greg Kurz
> Gregory Kurz proposed a solution:
> * when shutdown is called and we are not in the init pidns, then we kill
> the process 1 of the pidnamespace.
> * when reboot is called and we are not in the init pidns, then we reexec
> the init process, using the same command line. I guess this one could be
> easily retrieved if we are able to display /proc/1/cmdline ;)
>
> IMHO, this is a good proposition because it is generic and intuitive, no ?
>
> What do you thing ?
Yes, I think it makes sense. Do we have any prototype patches that
implement this behavior ?
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/7][v8] Container-init signal semantics
2009-03-07 19:04 ` Sukadev Bhattiprolu
@ 2009-03-07 19:43 ` Daniel Lezcano
2009-03-07 19:51 ` Greg Kurz
0 siblings, 1 reply; 29+ messages in thread
From: Daniel Lezcano @ 2009-03-07 19:43 UTC (permalink / raw)
To: Sukadev Bhattiprolu
Cc: Andrew Morton, linux-kernel, Eric W. Biederman, Containers,
Oleg Nesterov, roland, Greg Kurz
[-- Attachment #1: Type: text/plain, Size: 814 bytes --]
Sukadev Bhattiprolu wrote:
>> Gregory Kurz proposed a solution:
>> * when shutdown is called and we are not in the init pidns, then we kill
>> the process 1 of the pidnamespace.
>> * when reboot is called and we are not in the init pidns, then we reexec
>> the init process, using the same command line. I guess this one could be
>> easily retrieved if we are able to display /proc/1/cmdline ;)
>>
>> IMHO, this is a good proposition because it is generic and intuitive, no ?
>>
>> What do you thing ?
>>
>
> Yes, I think it makes sense. Do we have any prototype patches that
> implement this behavior ?
>
I have a simple prototype for the first case, added in attachment.
For the second case, I didn't had time to do it yet as it is not so
trivial because we force an exec of another process.
[-- Attachment #2: kill-cinit-at-shutdown.patch --]
[-- Type: text/x-diff, Size: 2785 bytes --]
Subject: kill the pid 1 process at shutdown
From: Daniel Lezcano <daniel.lezcano@free.fr>
This patch makes the pid 1 to be killed when we shutdown the host
and we are not in the init namespace.
Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
---
include/linux/pid_namespace.h | 5 +++++
kernel/pid_namespace.c | 17 +++++++++++++++++
kernel/sys.c | 17 +++++++++++------
3 files changed, 33 insertions(+), 6 deletions(-)
Index: 2.6.29-rc3/include/linux/pid_namespace.h
===================================================================
--- 2.6.29-rc3.orig/include/linux/pid_namespace.h
+++ 2.6.29-rc3/include/linux/pid_namespace.h
@@ -44,6 +44,7 @@ static inline struct pid_namespace *get_
extern struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *ns);
extern void free_pid_ns(struct kref *kref);
+extern int power_off_pid_ns(struct pid_namespace *ns);
extern void zap_pid_ns_processes(struct pid_namespace *pid_ns);
static inline void put_pid_ns(struct pid_namespace *ns)
@@ -72,6 +73,10 @@ static inline void put_pid_ns(struct pid
{
}
+static inline int power_off_pid_ns(struct pid_namespace *ns)
+{
+ return -EINVAL;
+}
static inline void zap_pid_ns_processes(struct pid_namespace *ns)
{
Index: 2.6.29-rc3/kernel/pid_namespace.c
===================================================================
--- 2.6.29-rc3.orig/kernel/pid_namespace.c
+++ 2.6.29-rc3/kernel/pid_namespace.c
@@ -148,6 +148,23 @@ void free_pid_ns(struct kref *kref)
put_pid_ns(parent);
}
+int power_off_pid_ns(struct pid_namespace *pid_ns)
+{
+ struct task_struct *task;
+
+ if (pid_ns == &init_pid_ns)
+ return -EINVAL;
+
+ rcu_read_lock();
+
+ task = pid_task(find_vpid(1), PIDTYPE_PID);
+ force_sig(SIGKILL, task);
+
+ rcu_read_unlock();
+
+ return 0;
+}
+
void zap_pid_ns_processes(struct pid_namespace *pid_ns)
{
int nr;
Index: 2.6.29-rc3/kernel/sys.c
===================================================================
--- 2.6.29-rc3.orig/kernel/sys.c
+++ 2.6.29-rc3/kernel/sys.c
@@ -34,6 +34,7 @@
#include <linux/seccomp.h>
#include <linux/cpu.h>
#include <linux/ptrace.h>
+#include <linux/pid_namespace.h>
#include <linux/compat.h>
#include <linux/syscalls.h>
@@ -393,15 +394,19 @@ SYSCALL_DEFINE4(reboot, int, magic1, int
break;
case LINUX_REBOOT_CMD_HALT:
- kernel_halt();
- unlock_kernel();
- do_exit(0);
+ if (power_off_pid_ns(current->nsproxy->pid_ns)) {
+ kernel_halt();
+ unlock_kernel();
+ do_exit(0);
+ }
break;
case LINUX_REBOOT_CMD_POWER_OFF:
- kernel_power_off();
- unlock_kernel();
- do_exit(0);
+ if (power_off_pid_ns(current->nsproxy->pid_ns)) {
+ kernel_power_off();
+ unlock_kernel();
+ do_exit(0);
+ }
break;
case LINUX_REBOOT_CMD_RESTART2:
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 0/7][v8] Container-init signal semantics
2009-03-07 19:43 ` Daniel Lezcano
@ 2009-03-07 19:51 ` Greg Kurz
2009-03-07 19:59 ` Daniel Lezcano
0 siblings, 1 reply; 29+ messages in thread
From: Greg Kurz @ 2009-03-07 19:51 UTC (permalink / raw)
To: Daniel Lezcano
Cc: Sukadev Bhattiprolu, Andrew Morton, linux-kernel,
Eric W. Biederman, Containers, Oleg Nesterov, roland
On Sat, 2009-03-07 at 20:43 +0100, Daniel Lezcano wrote:
> case LINUX_REBOOT_CMD_HALT:
> - kernel_halt();
> - unlock_kernel();
> - do_exit(0);
> + if (power_off_pid_ns(current->nsproxy->pid_ns)) {
> + kernel_halt();
> + unlock_kernel();
> + do_exit(0);
> + }
Even if current will get SIGKILLed when zap_pid_ns_processes() is
called, I see no reason it doesn't call do_exit(0).
> break;
>
> case LINUX_REBOOT_CMD_POWER_OFF:
> - kernel_power_off();
> - unlock_kernel();
> - do_exit(0);
> + if (power_off_pid_ns(current->nsproxy->pid_ns)) {
> + kernel_power_off();
> + unlock_kernel();
> + do_exit(0);
> + }
Same.
> break;
>
--
Gregory Kurz gkurz@fr.ibm.com
Software Engineer @ IBM/Meiosys http://www.ibm.com
Tel +33 (0)534 638 479 Fax +33 (0)561 400 420
"Anarchy is about taking complete responsibility for yourself."
Alan Moore.
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [PATCH 0/7][v8] Container-init signal semantics
2009-03-07 19:51 ` Greg Kurz
@ 2009-03-07 19:59 ` Daniel Lezcano
0 siblings, 0 replies; 29+ messages in thread
From: Daniel Lezcano @ 2009-03-07 19:59 UTC (permalink / raw)
To: Greg Kurz
Cc: Sukadev Bhattiprolu, Andrew Morton, linux-kernel,
Eric W. Biederman, Containers, Oleg Nesterov, roland
Greg Kurz wrote:
> On Sat, 2009-03-07 at 20:43 +0100, Daniel Lezcano wrote:
>
>> case LINUX_REBOOT_CMD_HALT:
>> - kernel_halt();
>> - unlock_kernel();
>> - do_exit(0);
>> + if (power_off_pid_ns(current->nsproxy->pid_ns)) {
>> + kernel_halt();
>> + unlock_kernel();
>> + do_exit(0);
>> + }
>>
>
> Even if current will get SIGKILLed when zap_pid_ns_processes() is
> called, I see no reason it doesn't call do_exit(0).
Right and unlock_kernel too :)
>
>> break;
>>
>> case LINUX_REBOOT_CMD_POWER_OFF:
>> - kernel_power_off();
>> - unlock_kernel();
>> - do_exit(0);
>> + if (power_off_pid_ns(current->nsproxy->pid_ns)) {
>> + kernel_power_off();
>> + unlock_kernel();
>> + do_exit(0);
>> + }
>>
>
> Same.
>
>
>> break;
>>
>>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/7][v8] Container-init signal semantics
2009-02-19 3:02 [PATCH 0/7][v8] Container-init signal semantics Sukadev Bhattiprolu
` (7 preceding siblings ...)
2009-02-19 14:59 ` [PATCH 0/7][v8] Container-init signal semantics Daniel Lezcano
@ 2009-02-19 20:53 ` Oleg Nesterov
8 siblings, 0 replies; 29+ messages in thread
From: Oleg Nesterov @ 2009-02-19 20:53 UTC (permalink / raw)
To: Sukadev Bhattiprolu
Cc: Andrew Morton, roland, Eric W. Biederman, daniel, Containers,
linux-kernel
On 02/18, Sukadev Bhattiprolu wrote:
>
> Patch 5/7 is new in this set and fixes a bug.
To clarify, the current code is buggy, and the fix doesn't depend on
any other patch, afaics.
> Remaining patches are
> just a forward-port from previous version and I believe they address
> all comments I have received.
>
> Oleg please sign-off/ack if you agree.
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
^ permalink raw reply [flat|nested] 29+ messages in thread