[PATCH 0/9] Core pid namespace enhancements

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/9] Core pid namespace enhancements
@ 2007-12-12 12:38 Eric W. Biederman
       [not found] ` <m13au8ytos.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-12 12:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Containers, Oleg Nesterov, Pavel Emelyanov

The following patchset updates the pid namespace infrastructure
so we don't constantly have to worry if we have been called
before or after exit_task_namespaces, by using the pid_namespace
obtained from a processes pid, handles the general case of setting
si_pid in struct sig_info, changes where we drop signals sent to init,
and enhances that changes to also work with the per namespace init.

Thus resolving most of the big gotchas with the current pid namespace
implementation.

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 1/9] sig: Fix mqueue pid
       [not found] ` <m13au8ytos.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-12 12:40   ` Eric W. Biederman
       [not found]     ` <m1y7c0xezt.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  2007-12-18  0:52   ` [PATCH 0/9] Core pid namespace enhancements sukadev-r/Jw6+rmf7HQT0dZR+AlfA
  1 sibling, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-12 12:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Containers, Oleg Nesterov, Pavel Emelyanov


Currently in the sig info for posix message queues we are reporting the
task id and not the task group id.  Given that this is a posix interface
and that si_pid is defined by posix as returning the process id, we should
be reporting the task group id.

So fix this interface.

Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 ipc/mqueue.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 7d1b8aa..d3feadf 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -510,7 +510,7 @@ static void __do_notify(struct mqueue_inode_info *info)
 			sig_i.si_errno = 0;
 			sig_i.si_code = SI_MESGQ;
 			sig_i.si_value = info->notify.sigev_value;
-			sig_i.si_pid = task_pid_vnr(current);
+			sig_i.si_pid = task_tgid_vnr(current);
 			sig_i.si_uid = current->uid;
 
 			kill_pid_info(info->notify.sigev_signo,
-- 
1.5.3.rc6.17.g1911

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 2/9] sig: Fix SI_USER si_pid
       [not found]     ` <m1y7c0xezt.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-12 12:42       ` Eric W. Biederman
       [not found]         ` <m1tzmoxexb.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  2007-12-12 13:24       ` [PATCH 1/9] sig: Fix mqueue pid Pavel Emelyanov
  1 sibling, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-12 12:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Containers, Oleg Nesterov, Pavel Emelyanov


Currently we have a few instances where we are generating signals
setting si_code to SI_USER and si_pid to the task id.

However in the case of SI_USER we are using a posix defined
interface, and posix defines si_pid as the sending process id.  Which
in linux is equivalent to the task group id.

So fix SI_USER to fill in the proper si_pid value.

Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 kernel/signal.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 280bccb..694a643 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -694,7 +694,7 @@ static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
 			q->info.si_signo = sig;
 			q->info.si_errno = 0;
 			q->info.si_code = SI_USER;
-			q->info.si_pid = task_pid_vnr(current);
+			q->info.si_pid = task_tgid_vnr(current);
 			q->info.si_uid = current->uid;
 			break;
 		case (unsigned long) SEND_SIG_PRIV:
@@ -1794,7 +1794,7 @@ relock:
 				info->si_signo = signr;
 				info->si_errno = 0;
 				info->si_code = SI_USER;
-				info->si_pid = task_pid_vnr(current->parent);
+				info->si_pid = task_tgid_vnr(current->parent);
 				info->si_uid = current->parent->uid;
 			}
 
-- 
1.5.3.rc6.17.g1911

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 3/9] pid: Implement ns_of_pid.
       [not found]         ` <m1tzmoxexb.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-12 12:44           ` Eric W. Biederman
       [not found]             ` <m1prxcxeum.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-12 12:44 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Containers, Oleg Nesterov, Pavel Emelyanov

A current problem with the pid namespace is that it is
easy to do pid related work after exit_task_namespaces which
drops the nsproxy pointer.

However if we are doing pid namespace related work we are
always operating on some struct pid which retains the pid_namespace
pointer of the pid namespace it was allocated in.

So provide ns_of_pid which allows us to find the pid
namespace a pid was allocated in.

Using this we have the needed infrastructure to do pid
namespace related work at anytime we have a struct pid,
removing the chance of accidentally having a NULL
pointer dereference when accessing current->nsproxy.

Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 include/linux/pid.h |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index f84d532..c4b56c0 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -123,6 +123,17 @@ int next_pidmap(struct pid_namespace *pid_ns, int last);
 extern struct pid *alloc_pid(struct pid_namespace *ns);
 extern void FASTCALL(free_pid(struct pid *pid));

+/* ns_of_pid returns the pid namespace in which the specified
+ * pid was allocated.
+ */
+static inline struct pid_namespace *ns_of_pid(struct pid *pid)
+{
+	struct pid_namespace *ns = NULL;
+	if (pid)
+		ns = pid->numbers[pid->level].ns;
+	return ns;
+}
+
 /*
  * the helpers to get the pid's id seen from different namespaces
  *
-- 
1.5.3.rc6.17.g1911

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 4/9] pid: Generalize task_active_pid_ns
       [not found]             ` <m1prxcxeum.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-12 12:46               ` Eric W. Biederman
       [not found]                 ` <m1lk80xeps.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  2007-12-13  0:59               ` [PATCH 3/9] pid: Implement ns_of_pid sukadev-r/Jw6+rmf7HQT0dZR+AlfA
  1 sibling, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-12 12:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Containers, Oleg Nesterov, Pavel Emelyanov


Currently task_active_pid_ns is not safe to call after a
task becomes a zombie and exit_task_namespaces is called,
as nsproxy becomes NULL.  By reading the pid namespace from
the pid of the task we can trivially solve this problem at
the cost of one extra memory read in what should be the
same cacheline as we read the namespace from.

When moving things around I have made task_active_pid_ns
out of line because keeping it in pid_namespace.h would
require adding includes of pid.h and sched.h that I
don't think we want.

This change does make task_active_pid_ns unsafe to call during
copy_process until we attach a pid on the task_struct which
seems to be a reasonable trade off.

Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 include/linux/pid_namespace.h |    5 +----
 kernel/fork.c                 |    4 ++--
 kernel/pid.c                  |    6 ++++++
 3 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index fcd61fa..ab13faa 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -74,10 +74,7 @@ static inline void zap_pid_ns_processes(struct pid_namespace *ns)
 }
 #endif /* CONFIG_PID_NS */
 
-static inline struct pid_namespace *task_active_pid_ns(struct task_struct *tsk)
-{
-	return tsk->nsproxy->pid_ns;
-}
+extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
 
 static inline struct task_struct *task_child_reaper(struct task_struct *tsk)
 {
diff --git a/kernel/fork.c b/kernel/fork.c
index 32d75d8..a210860 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1164,12 +1164,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 
 	if (pid != &init_struct_pid) {
 		retval = -ENOMEM;
-		pid = alloc_pid(task_active_pid_ns(p));
+		pid = alloc_pid(p->nsproxy->pid_ns);
 		if (!pid)
 			goto bad_fork_cleanup_namespaces;
 
 		if (clone_flags & CLONE_NEWPID) {
-			retval = pid_ns_prepare_proc(task_active_pid_ns(p));
+			retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
 			if (retval < 0)
 				goto bad_fork_free_pid;
 		}
diff --git a/kernel/pid.c b/kernel/pid.c
index c507ca7..54d7634 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -472,6 +472,12 @@ pid_t task_session_nr_ns(struct task_struct *tsk, struct pid_namespace *ns)
 }
 EXPORT_SYMBOL(task_session_nr_ns);
 
+struct pid_namespace *task_active_pid_ns(struct task_struct *tsk)
+{
+	return ns_of_pid(task_pid(tsk));
+}
+EXPORT_SYMBOL_GPL(task_active_pid_ns);
+
 /*
  * Used by proc to find the first pid that is greater then or equal to nr.
  *
-- 
1.5.3.rc6.17.g1911

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 5/9] pid: Update pid_vnr to use task_active_pid_ns
       [not found]                 ` <m1lk80xeps.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-12 12:48                   ` Eric W. Biederman
       [not found]                     ` <m1hcioxenh.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  2007-12-13 16:01                   ` [PATCH 4/9] pid: Generalize task_active_pid_ns Oleg Nesterov
  1 sibling, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-12 12:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Containers, Oleg Nesterov, Pavel Emelyanov


There is no real additional cost in using task_active_pid_ns and
by doing so it means we do not need to audit the callers of pid_vnr
closely, to ensure they are always called on living processes.

Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 kernel/pid.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index 54d7634..a7ecfce 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -444,7 +444,7 @@ pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
 
 pid_t pid_vnr(struct pid *pid)
 {
-	return pid_nr_ns(pid, current->nsproxy->pid_ns);
+	return pid_nr_ns(pid, task_active_pid_ns(current));
 }
 EXPORT_SYMBOL_GPL(pid_vnr);
 
-- 
1.5.3.rc6.17.g1911

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 6/9] pid: Implement pid_in_pid_ns.
       [not found]                     ` <m1hcioxenh.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-12 12:49                       ` Eric W. Biederman
       [not found]                         ` <m1d4tcxelu.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  2007-12-12 13:28                       ` [PATCH 5/9] pid: Update pid_vnr to use task_active_pid_ns Pavel Emelyanov
  1 sibling, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-12 12:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Containers, Oleg Nesterov, Pavel Emelyanov


Implement a simple check to see if a specified pid resides
in a specified pid namespace.  Generally we are smart enough
just to operate on single pid namespace and avoid this test.
However in the case of sending a signal to init we must check
to see if our sender is a child of init and so this test is needed.

Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 include/linux/pid.h |    3 +++
 kernel/pid.c        |    6 ++++++
 2 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index c4b56c0..e409cc5 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -156,6 +156,9 @@ static inline pid_t pid_nr(struct pid *pid)
 pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns);
 pid_t pid_vnr(struct pid *pid);
 
+/* Test to see if pid is present in pid namespace ns */
+extern int pid_in_pid_ns(struct pid *pid, struct pid_namespace *ns);
+
 #define do_each_pid_task(pid, type, task)				\
 	do {								\
 		struct hlist_node *pos___;				\
diff --git a/kernel/pid.c b/kernel/pid.c
index a7ecfce..873c00f 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -429,6 +429,12 @@ struct pid *find_get_pid(pid_t nr)
 	return pid;
 }
 
+int pid_in_pid_ns(struct pid *pid, struct pid_namespace *ns)
+{
+	return pid && ns && (ns->level <= pid->level) &&
+		pid->numbers[ns->level].ns == ns;
+}
+
 pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
 {
 	struct upid *upid;
-- 
1.5.3.rc6.17.g1911

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 7/9] sig: Handle pid namespace crossing when sending signals.
       [not found]                         ` <m1d4tcxelu.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-12 12:52                           ` Eric W. Biederman
       [not found]                             ` <m18x40xeg6.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  2007-12-12 13:33                           ` [PATCH 6/9] pid: Implement pid_in_pid_ns Pavel Emelyanov
  1 sibling, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-12 12:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Containers, Oleg Nesterov, Pavel Emelyanov


Setting si_pid correctly in the context of pid namespaces is tricky.
Currently with the special cases in do_notify_parent and
do_notify_parent_cldstop we handle all of the day to day cases
properly except sending a signal to a task in a child pid namespace.
For that case we need to pretend to be the kernel and set si_pid to 0.

There are also a few theoretical cases where we can trigger sending a
signal from a task in one pid namespace to a task in another pid
namespace.  With no necessary correlation between one or the other.
In those cases when the source pid namespace is a child of the
destination pid namespace we actually have a valid pid value
we can and should report to user space.

This patch modifies the code to handle the full general case for
setting si_pid.  The code is a little longer but occurs only once and
making it some easier to understand and verify it is correct.

I add a struct pid sender parameter to __group_send_sig_info, as that is
the only function called with si_pid != task_tgid_vnr(current).  So we can
correctly handle the sending of a signal to the parent of an arbitrary
task.

Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 drivers/char/tty_io.c     |    4 +-
 include/linux/signal.h    |    3 +-
 ipc/mqueue.c              |    2 +-
 kernel/posix-cpu-timers.c |    8 ++--
 kernel/signal.c           |   88 +++++++++++++++++++++++++++-----------------
 5 files changed, 63 insertions(+), 42 deletions(-)

diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c
index 613ec81..c121cdb 100644
--- a/drivers/char/tty_io.c
+++ b/drivers/char/tty_io.c
@@ -1435,8 +1435,8 @@ static void do_tty_hangup(struct work_struct *work)
 				spin_unlock_irq(&p->sighand->siglock);
 				continue;
 			}
-			__group_send_sig_info(SIGHUP, SEND_SIG_PRIV, p);
-			__group_send_sig_info(SIGCONT, SEND_SIG_PRIV, p);
+			__group_send_sig_info(SIGHUP, SEND_SIG_PRIV, p, NULL);
+			__group_send_sig_info(SIGCONT, SEND_SIG_PRIV, p, NULL);
 			put_pid(p->signal->tty_old_pgrp);  /* A noop */
 			if (tty->pgrp)
 				p->signal->tty_old_pgrp = get_pid(tty->pgrp);
diff --git a/include/linux/signal.h b/include/linux/signal.h
index 42d2e0a..0a13489 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -234,7 +234,8 @@ static inline int valid_signal(unsigned long sig)
 
 extern int next_signal(struct sigpending *pending, sigset_t *mask);
 extern int group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p);
-extern int __group_send_sig_info(int, struct siginfo *, struct task_struct *);
+extern int __group_send_sig_info(int, struct siginfo *, struct task_struct *,
+				 struct pid *sender);
 extern long do_sigpending(void __user *, unsigned long);
 extern int sigprocmask(int, sigset_t *, sigset_t *);
 extern int show_unhandled_signals;
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index d3feadf..b0bf0b0 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -510,7 +510,7 @@ static void __do_notify(struct mqueue_inode_info *info)
 			sig_i.si_errno = 0;
 			sig_i.si_code = SI_MESGQ;
 			sig_i.si_value = info->notify.sigev_value;
-			sig_i.si_pid = task_tgid_vnr(current);
+			sig_i.si_pid = 0;	/* Uses default current tgid */
 			sig_i.si_uid = current->uid;
 
 			kill_pid_info(info->notify.sigev_signo,
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 68c9637..91f80b9 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -1109,7 +1109,7 @@ static void check_process_timers(struct task_struct *tsk,
 				sig->it_prof_expires = cputime_add(
 					sig->it_prof_expires, ptime);
 			}
-			__group_send_sig_info(SIGPROF, SEND_SIG_PRIV, tsk);
+			__group_send_sig_info(SIGPROF, SEND_SIG_PRIV, tsk, NULL);
 		}
 		if (!cputime_eq(sig->it_prof_expires, cputime_zero) &&
 		    (cputime_eq(prof_expires, cputime_zero) ||
@@ -1125,7 +1125,7 @@ static void check_process_timers(struct task_struct *tsk,
 				sig->it_virt_expires = cputime_add(
 					sig->it_virt_expires, utime);
 			}
-			__group_send_sig_info(SIGVTALRM, SEND_SIG_PRIV, tsk);
+			__group_send_sig_info(SIGVTALRM, SEND_SIG_PRIV, tsk, NULL);
 		}
 		if (!cputime_eq(sig->it_virt_expires, cputime_zero) &&
 		    (cputime_eq(virt_expires, cputime_zero) ||
@@ -1141,14 +1141,14 @@ static void check_process_timers(struct task_struct *tsk,
 			 * At the hard limit, we just die.
 			 * No need to calculate anything else now.
 			 */
-			__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
+			__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk, NULL);
 			return;
 		}
 		if (psecs >= sig->rlim[RLIMIT_CPU].rlim_cur) {
 			/*
 			 * At the soft limit, send a SIGXCPU every second.
 			 */
-			__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk);
+			__group_send_sig_info(SIGXCPU, SEND_SIG_PRIV, tsk, NULL);
 			if (sig->rlim[RLIMIT_CPU].rlim_cur
 			    < sig->rlim[RLIMIT_CPU].rlim_max) {
 				sig->rlim[RLIMIT_CPU].rlim_cur++;
diff --git a/kernel/signal.c b/kernel/signal.c
index 694a643..c01e3cd 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -657,8 +657,40 @@ static void handle_stop_signal(int sig, struct task_struct *p)
 	}
 }
 
+static void set_sigqueue_pid(struct sigqueue *q, struct task_struct *t,
+			     struct pid *sender)
+{
+	struct pid_namespace *ns;
+
+	/* Set si_pid to the pid number of sender in the pid namespace of
+	 * our destination task for all siginfo types that support it.
+	 */
+	switch(q->info.si_code & __SI_MASK) {
+		/* siginfo without si_pid */
+	case __SI_TIMER:
+	case __SI_POLL:
+	case __SI_FAULT:
+		break;
+		/* siginfo with si_pid */
+	case __SI_KILL:
+	case __SI_CHLD:
+	case __SI_RT:
+	case __SI_MESGQ:
+	default:
+		/* si_pid for SI_KERNEL is always 0 */
+		if (q->info.si_code == SI_KERNEL)
+			break;
+		/* Is current not the sending task? */
+		if (!sender)
+			sender = task_tgid(current);
+		ns = task_active_pid_ns(t);
+		q->info.si_pid = pid_nr_ns(sender, ns);
+		break;
+	}
+}
+
 static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
-			struct sigpending *signals)
+			struct sigpending *signals, struct pid *sender)
 {
 	struct sigqueue * q = NULL;
 	int ret = 0;
@@ -694,8 +726,9 @@ static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
 			q->info.si_signo = sig;
 			q->info.si_errno = 0;
 			q->info.si_code = SI_USER;
-			q->info.si_pid = task_tgid_vnr(current);
+			q->info.si_pid = 0;	/* Uses current tgid */
 			q->info.si_uid = current->uid;
+			sender = task_tgid(current);
 			break;
 		case (unsigned long) SEND_SIG_PRIV:
 			q->info.si_signo = sig;
@@ -708,6 +741,7 @@ static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
 			copy_siginfo(&q->info, info);
 			break;
 		}
+		set_sigqueue_pid(q, t, sender);
 	} else if (!is_si_special(info)) {
 		if (sig >= SIGRTMIN && info->si_code != SI_USER)
 		/*
@@ -775,7 +809,7 @@ specific_send_sig_info(int sig, struct siginfo *info, struct task_struct *t)
 	if (LEGACY_QUEUE(&t->pending, sig))
 		goto out;
 
-	ret = send_signal(sig, info, t, &t->pending);
+	ret = send_signal(sig, info, t, &t->pending, NULL);
 	if (!ret && !sigismember(&t->blocked, sig))
 		signal_wake_up(t, sig == SIGKILL);
 out:
@@ -922,7 +956,8 @@ __group_complete_signal(int sig, struct task_struct *p)
 }
 
 int
-__group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
+__group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p,
+		      struct pid *sender)
 {
 	int ret = 0;
 
@@ -942,7 +977,7 @@ __group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
 	 * We always use the shared queue for process-wide signals,
 	 * to avoid several races.
 	 */
-	ret = send_signal(sig, info, p, &p->signal->shared_pending);
+	ret = send_signal(sig, info, p, &p->signal->shared_pending, sender);
 	if (unlikely(ret))
 		return ret;
 
@@ -1008,7 +1043,7 @@ int group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
 	if (!ret && sig) {
 		ret = -ESRCH;
 		if (lock_task_sighand(p, &flags)) {
-			ret = __group_send_sig_info(sig, info, p);
+			ret = __group_send_sig_info(sig, info, p, NULL);
 			unlock_task_sighand(p, &flags);
 		}
 	}
@@ -1114,7 +1149,7 @@ int kill_pid_info_as_uid(int sig, struct siginfo *info, struct pid *pid,
 	if (sig && p->sighand) {
 		unsigned long flags;
 		spin_lock_irqsave(&p->sighand->siglock, flags);
-		ret = __group_send_sig_info(sig, info, p);
+		ret = __group_send_sig_info(sig, info, p, NULL);
 		spin_unlock_irqrestore(&p->sighand->siglock, flags);
 	}
 out_unlock:
@@ -1415,6 +1450,7 @@ void do_notify_parent(struct task_struct *tsk, int sig)
 	struct siginfo info;
 	unsigned long flags;
 	struct sighand_struct *psig;
+	struct pid *sender;
 
 	BUG_ON(sig == -1);
 
@@ -1424,24 +1460,11 @@ void do_notify_parent(struct task_struct *tsk, int sig)
 	BUG_ON(!tsk->ptrace &&
 	       (tsk->group_leader != tsk || !thread_group_empty(tsk)));
 
+	/* We are under tasklist_lock so no need to call get_pid */
+	sender = task_pid(tsk);
 	info.si_signo = sig;
 	info.si_errno = 0;
-	/*
-	 * we are under tasklist_lock here so our parent is tied to
-	 * us and cannot exit and release its namespace.
-	 *
-	 * the only it can is to switch its nsproxy with sys_unshare,
-	 * bu uncharing pid namespaces is not allowed, so we'll always
-	 * see relevant namespace
-	 *
-	 * write_lock() currently calls preempt_disable() which is the
-	 * same as rcu_read_lock(), but according to Oleg, this is not
-	 * correct to rely on this
-	 */
-	rcu_read_lock();
-	info.si_pid = task_pid_nr_ns(tsk, tsk->parent->nsproxy->pid_ns);
-	rcu_read_unlock();
-
+	info.si_pid = 0;	/* Filled in later from sender */
 	info.si_uid = tsk->uid;
 
 	/* FIXME: find out whether or not this is supposed to be c*time. */
@@ -1485,7 +1508,7 @@ void do_notify_parent(struct task_struct *tsk, int sig)
 			sig = 0;
 	}
 	if (valid_signal(sig) && sig > 0)
-		__group_send_sig_info(sig, &info, tsk->parent);
+		__group_send_sig_info(sig, &info, tsk->parent, sender);
 	__wake_up_parent(tsk, tsk->parent);
 	spin_unlock_irqrestore(&psig->siglock, flags);
 }
@@ -1496,6 +1519,7 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, int why)
 	unsigned long flags;
 	struct task_struct *parent;
 	struct sighand_struct *sighand;
+	struct pid *sender;
 
 	if (tsk->ptrace & PT_PTRACED)
 		parent = tsk->parent;
@@ -1504,15 +1528,11 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, int why)
 		parent = tsk->real_parent;
 	}
 
+	/* We are under tasklist_lock so no need to call get_pid */
+	sender = task_pid(tsk);
 	info.si_signo = SIGCHLD;
 	info.si_errno = 0;
-	/*
-	 * see comment in do_notify_parent() abot the following 3 lines
-	 */
-	rcu_read_lock();
-	info.si_pid = task_pid_nr_ns(tsk, tsk->parent->nsproxy->pid_ns);
-	rcu_read_unlock();
-
+	info.si_pid = 0;	/* Filled in later from sender */
 	info.si_uid = tsk->uid;
 
 	/* FIXME: find out whether or not this is supposed to be c*time. */
@@ -1538,7 +1558,7 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, int why)
 	spin_lock_irqsave(&sighand->siglock, flags);
 	if (sighand->action[SIGCHLD-1].sa.sa_handler != SIG_IGN &&
 	    !(sighand->action[SIGCHLD-1].sa.sa_flags & SA_NOCLDSTOP))
-		__group_send_sig_info(SIGCHLD, &info, parent);
+		__group_send_sig_info(SIGCHLD, &info, parent, sender);
 	/*
 	 * Even if SIGCHLD is not generated, we must wake up wait4 calls.
 	 */
@@ -2223,7 +2243,7 @@ sys_kill(int pid, int sig)
 	info.si_signo = sig;
 	info.si_errno = 0;
 	info.si_code = SI_USER;
-	info.si_pid = task_tgid_vnr(current);
+	info.si_pid = 0;	/* Uses default current tgid */
 	info.si_uid = current->uid;
 
 	return kill_something_info(sig, &info, pid);
@@ -2239,7 +2259,7 @@ static int do_tkill(int tgid, int pid, int sig)
 	info.si_signo = sig;
 	info.si_errno = 0;
 	info.si_code = SI_TKILL;
-	info.si_pid = task_tgid_vnr(current);
+	info.si_pid = 0;	/* Uses default current tgid */
 	info.si_uid = current->uid;
 
 	read_lock(&tasklist_lock);
-- 
1.5.3.rc6.17.g1911

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 8/9] signal: Drop signals before sending them to init.
       [not found]                             ` <m18x40xeg6.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-12 12:57                               ` Eric W. Biederman
       [not found]                                 ` <m13au8xe8m.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-12 12:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Containers, Oleg Nesterov, Pavel Emelyanov

Currently init drops all signals including SIGKILL that are sent to it
that it does not ignore and does have a handler for.  Extending this to
pid namespaces where we want to maintain this semantic for signals sent
from inside the pid namespace (descendents of init) but we want the
namespace init to appear as a normal process is problematic, as a
naive approach requires always tracking the sender of a signal.

By making the rule (for init dropping signals):
When sending a signal to init, the presence of a signal handler that
is not SIG_DFL allows the signal to be sent to init.  If the signal
is not sent it is silently dropped without becoming pending.

The only noticeable user space difference from todays init is that it
no longer needs to worry about signals becoming pending when it has
them marked as SIG_DFL and blocked.

This change by making the presence of a signal handler effectively
a permission check allows us to do all of the work before we enqueue
the signal, and there is no need for any fancy tracking of the
signal sender.

Which means that we can now allow force_sig_info to send signals to
init, that panic the kernel instead of going into an infinite busy
loop taking an exception sending a signal and then retaking the same
exception, eating all of the cpu time but accomplishing nothing.

This change also makes it possible to easily implement the desired
semantics of /sbin/init for pid namespaces where outer processes can
kill init but processes inside the pid namespace can not.

While it is now easy to remove the dropping of signals from individual
code paths such as force_sig_info this patch does not implement that,
to retain as much of the current behavior as possible.  The only
behavioral difference besides not queuing blocked SIG_DFL signals
are signals directly added with sigaddset.  In practice a threaded init
now receives SIGKILL sent from a core dump, a thread group exit, or an
exec shutting down extraneous threads.

This patch was inspired by a patch from Oleg and initial refined
in conversation with Suka and others on the containers list.

Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 kernel/signal.c |   47 ++++++++++++++++++++++++++++++++++++++---------
 1 files changed, 38 insertions(+), 9 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index c01e3cd..029a45d 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -64,6 +64,25 @@ static int sig_ignored(struct task_struct *t, int sig)
 		(handler == SIG_DFL && sig_kernel_ignore(sig));
 }

+static int is_sig_init(struct task_struct *tsk)
+{
+	if (likely(!is_global_init(tsk->group_leader)))
+		return 0;
+
+	return 1;
+}
+
+static int sig_init_drop(struct task_struct *tsk, int sig)
+{
+	/* All signals for which init has a SIG_DFL handler are
+	 * silently dropped without being sent.
+	 */
+	if (!is_sig_init(tsk))
+		return 0;
+
+	return (tsk->sighand->action[sig-1].sa.sa_handler == SIG_DFL);
+}
+
 /*
  * Re-calculate pending state from the set of locally pending
  * signals, globally pending signals, and blocked signals.
@@ -834,6 +853,9 @@ force_sig_info(int sig, struct siginfo *info, struct task_struct *t)
 	struct k_sigaction *action;

 	spin_lock_irqsave(&t->sighand->siglock, flags);
+	ret = 0;
+	if (sig_init_drop(t, sig))
+		goto out;
 	action = &t->sighand->action[sig-1];
 	ignored = action->sa.sa_handler == SIG_IGN;
 	blocked = sigismember(&t->blocked, sig);
@@ -845,6 +867,7 @@ force_sig_info(int sig, struct siginfo *info, struct task_struct *t)
 		}
 	}
 	ret = specific_send_sig_info(sig, info, t);
+out:
 	spin_unlock_irqrestore(&t->sighand->siglock, flags);

 	return ret;
@@ -962,6 +985,9 @@ __group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p,
 	int ret = 0;

 	assert_spin_locked(&p->sighand->siglock);
+	if (sig_init_drop(p, sig))
+		return ret;
+
 	handle_stop_signal(sig, p);

 	/* Short-circuit ignored signals.  */
@@ -1224,7 +1250,9 @@ send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
 	 */
 	read_lock(&tasklist_lock);  
 	spin_lock_irqsave(&p->sighand->siglock, flags);
-	ret = specific_send_sig_info(sig, info, p);
+	ret = 0;
+	if (!sig_init_drop(p, sig))
+		ret = specific_send_sig_info(sig, info, p);
 	spin_unlock_irqrestore(&p->sighand->siglock, flags);
 	read_unlock(&tasklist_lock);
 	return ret;
@@ -1392,6 +1420,11 @@ send_group_sigqueue(int sig, struct sigqueue *q, struct task_struct *p)
 	read_lock(&tasklist_lock);
 	/* Since it_lock is held, p->sighand cannot be NULL. */
 	spin_lock_irqsave(&p->sighand->siglock, flags);
+	if (sig_init_drop(p, sig)) {
+		ret = 1;
+		goto out;
+	}
+
 	handle_stop_signal(sig, p);

 	/* Short-circuit ignored signals.  */
@@ -1844,12 +1877,6 @@ relock:
 		if (sig_kernel_ignore(signr)) /* Default is nothing. */
 			continue;

-		/*
-		 * Global init gets no signals it doesn't want.
-		 */
-		if (is_global_init(current))
-			continue;
-
 		if (sig_kernel_stop(signr)) {
 			/*
 			 * The default action is to stop all threads in
@@ -2272,8 +2299,10 @@ static int do_tkill(int tgid, int pid, int sig)
 		 */
 		if (!error && sig && p->sighand) {
 			spin_lock_irq(&p->sighand->siglock);
-			handle_stop_signal(sig, p);
-			error = specific_send_sig_info(sig, &info, p);
+			if (!sig_init_drop(p, sig)) {
+				handle_stop_signal(sig, p);
+				error = specific_send_sig_info(sig, &info, p);
+			}
 			spin_unlock_irq(&p->sighand->siglock);
 		}
 	}
-- 
1.5.3.rc6.17.g1911

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 9/9] signal: Ignore signals sent to the pid namespace init
       [not found]                                 ` <m13au8xe8m.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-12 12:58                                   ` Eric W. Biederman
       [not found]                                     ` <m1y7c0vzm4.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  2007-12-12 19:00                                   ` [PATCH 8/9] signal: Drop signals before sending them to init Serge E. Hallyn
  2007-12-13 16:25                                   ` Oleg Nesterov
  2 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-12 12:58 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Containers, Oleg Nesterov, Pavel Emelyanov


This patch extends sit_init_drop to take a sender pid so
it can make decisions about dropping signals based on
the sender of the signal.

With knowledge of the signal sender sig_init_drop is extended
to drop signals meant for the pid namespace init processes
as well as the global init process.  However only signals
with senders contained in the pid namespace of the init
process are dropped.  Ensuring that outside of the pid
namespace the pid namespace init process continues to look
like an ordinary process.

Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 kernel/signal.c |   24 +++++++++++++++---------
 1 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 029a45d..074905f 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -64,20 +64,26 @@ static int sig_ignored(struct task_struct *t, int sig)
 		(handler == SIG_DFL && sig_kernel_ignore(sig));
 }
 
-static int is_sig_init(struct task_struct *tsk)
+static int is_sig_init(struct task_struct *init, struct pid *sender)
 {
-	if (likely(!is_global_init(tsk->group_leader)))
+	if (!is_container_init(init))
+		return 0;
+
+	if (!sender)
+		sender = task_tgid(current);
+
+	if (!pid_in_pid_ns(sender, task_active_pid_ns(init)))
 		return 0;
 
 	return 1;
 }
 
-static int sig_init_drop(struct task_struct *tsk, int sig)
+static int sig_init_drop(struct task_struct *tsk, int sig, struct pid *sender)
 {
 	/* All signals for which init has a SIG_DFL handler are
 	 * silently dropped without being sent.
 	 */
-	if (!is_sig_init(tsk))
+	if (!is_sig_init(tsk, sender))
 		return 0;
 
 	return (tsk->sighand->action[sig-1].sa.sa_handler == SIG_DFL);
@@ -854,7 +860,7 @@ force_sig_info(int sig, struct siginfo *info, struct task_struct *t)
 
 	spin_lock_irqsave(&t->sighand->siglock, flags);
 	ret = 0;
-	if (sig_init_drop(t, sig))
+	if (sig_init_drop(t, sig, NULL))
 		goto out;
 	action = &t->sighand->action[sig-1];
 	ignored = action->sa.sa_handler == SIG_IGN;
@@ -985,7 +991,7 @@ __group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p,
 	int ret = 0;
 
 	assert_spin_locked(&p->sighand->siglock);
-	if (sig_init_drop(p, sig))
+	if (sig_init_drop(p, sig, sender))
 		return ret;
 
 	handle_stop_signal(sig, p);
@@ -1251,7 +1257,7 @@ send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
 	read_lock(&tasklist_lock);  
 	spin_lock_irqsave(&p->sighand->siglock, flags);
 	ret = 0;
-	if (!sig_init_drop(p, sig))
+	if (!sig_init_drop(p, sig, NULL))
 		ret = specific_send_sig_info(sig, info, p);
 	spin_unlock_irqrestore(&p->sighand->siglock, flags);
 	read_unlock(&tasklist_lock);
@@ -1420,7 +1426,7 @@ send_group_sigqueue(int sig, struct sigqueue *q, struct task_struct *p)
 	read_lock(&tasklist_lock);
 	/* Since it_lock is held, p->sighand cannot be NULL. */
 	spin_lock_irqsave(&p->sighand->siglock, flags);
-	if (sig_init_drop(p, sig)) {
+	if (sig_init_drop(p, sig, NULL)) {
 		ret = 1;
 		goto out;
 	}
@@ -2299,7 +2305,7 @@ static int do_tkill(int tgid, int pid, int sig)
 		 */
 		if (!error && sig && p->sighand) {
 			spin_lock_irq(&p->sighand->siglock);
-			if (!sig_init_drop(p, sig)) {
+			if (!sig_init_drop(p, sig, NULL)) {
 				handle_stop_signal(sig, p);
 				error = specific_send_sig_info(sig, &info, p);
 			}
-- 
1.5.3.rc6.17.g1911

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 0/4] pid namespace infrastructure cleanups
       [not found]                                     ` <m1y7c0vzm4.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-12 13:09                                       ` Eric W. Biederman
       [not found]                                         ` <m1odcwvz3d.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  2007-12-13 16:28                                       ` [PATCH 9/9] signal: Ignore signals sent to the pid namespace init Oleg Nesterov
  1 sibling, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-12 13:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Containers, Oleg Nesterov, Pavel Emelyanov

These next 4 patches are just generally nice cleanups to the pid
namespace infrastructure.  There is a slight context dependence on the
previous patchset but nothing fundamental.

These patches remove yet another special case from de_thread.
Remove special cases from proc_get_sb
Kill proc_mnt
Move the pid_namespace logic out of the fork fast path.

Just generally making those code paths much nicer to live with.

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/9] sig: Fix mqueue pid
       [not found]     ` <m1y7c0xezt.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  2007-12-12 12:42       ` [PATCH 2/9] sig: Fix SI_USER si_pid Eric W. Biederman
@ 2007-12-12 13:24       ` Pavel Emelyanov
  1 sibling, 0 replies; 52+ messages in thread
From: Pavel Emelyanov @ 2007-12-12 13:24 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Containers, Andrew Morton, Oleg Nesterov

Eric W. Biederman wrote:
> Currently in the sig info for posix message queues we are reporting the
> task id and not the task group id.  Given that this is a posix interface
> and that si_pid is defined by posix as returning the process id, we should
> be reporting the task group id.
> 
> So fix this interface.
> 
> Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

Acked-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>

> ---
>  ipc/mqueue.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/ipc/mqueue.c b/ipc/mqueue.c
> index 7d1b8aa..d3feadf 100644
> --- a/ipc/mqueue.c
> +++ b/ipc/mqueue.c
> @@ -510,7 +510,7 @@ static void __do_notify(struct mqueue_inode_info *info)
>  			sig_i.si_errno = 0;
>  			sig_i.si_code = SI_MESGQ;
>  			sig_i.si_value = info->notify.sigev_value;
> -			sig_i.si_pid = task_pid_vnr(current);
> +			sig_i.si_pid = task_tgid_vnr(current);
>  			sig_i.si_uid = current->uid;
>  
>  			kill_pid_info(info->notify.sigev_signo,

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 1/4] pidns: Remove the child_reaper special case from de_thread.
       [not found]                                         ` <m1odcwvz3d.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-12 13:27                                           ` Eric W. Biederman
       [not found]                                             ` <m1ir34vyaj.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-12 13:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Containers, Oleg Nesterov, Pavel Emelyanov


This patch inspired by Oleg's leader_pid change for the timer code now
tracks the child_reaper by pid instead of by task_struct, allowing us
to not need to update anything when de_thread is run.

The logic in de_thread that is supposed to let a threaded init call
exec has bit rotted.  Since the signal drop code does not look at
child_reaper changing child_reaper early is of no help.  Since
zap_other_threads uses sigaddset my previous patch to drop signals
sent to init before queueing them actually fixes this case.

Since the child_reaper never exits we don't have to worry about
reference counting the pid.

By tracking the child_reaper of a pid namespace by it's task group id,
we don't need to do anything when the leading task in the task group
exits.

Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/exec.c                     |    8 --------
 include/linux/pid_namespace.h |    9 ++-------
 init/main.c                   |    2 +-
 kernel/fork.c                 |    2 +-
 kernel/pid.c                  |   10 +++++++++-
 5 files changed, 13 insertions(+), 18 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index f73edfc..96b4822 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -770,14 +770,6 @@ static int de_thread(struct task_struct *tsk)
 		return -EAGAIN;
 	}
 
-	/*
-	 * child_reaper ignores SIGKILL, change it now.
-	 * Reparenting needs write_lock on tasklist_lock,
-	 * so it is safe to do it under read_lock.
-	 */
-	if (unlikely(tsk->group_leader == task_child_reaper(tsk)))
-		task_active_pid_ns(tsk)->child_reaper = tsk;
-
 	sig->group_exit_task = tsk;
 	zap_other_threads(tsk);
 	read_unlock(&tasklist_lock);
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index ab13faa..2e67033 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -18,7 +18,7 @@ struct pid_namespace {
 	struct kref kref;
 	struct pidmap pidmap[PIDMAP_ENTRIES];
 	int last_pid;
-	struct task_struct *child_reaper;
+	struct pid *child_reaper;
 	struct kmem_cache *pid_cachep;
 	int level;
 	struct pid_namespace *parent;
@@ -75,11 +75,6 @@ static inline void zap_pid_ns_processes(struct pid_namespace *ns)
 #endif /* CONFIG_PID_NS */
 
 extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
-
-static inline struct task_struct *task_child_reaper(struct task_struct *tsk)
-{
-	BUG_ON(tsk != current);
-	return tsk->nsproxy->pid_ns->child_reaper;
-}
+extern struct task_struct *task_child_reaper(struct task_struct *tsk);
 
 #endif /* _LINUX_PID_NS_H */
diff --git a/init/main.c b/init/main.c
index 5c6df6b..57404b6 100644
--- a/init/main.c
+++ b/init/main.c
@@ -828,7 +828,7 @@ static int __init kernel_init(void * unused)
 	 * assumptions about where in the task array this
 	 * can be found.
 	 */
-	init_pid_ns.child_reaper = current;
+	init_pid_ns.child_reaper = task_pid(current);
 
 	cad_pid = task_pid(current);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index a210860..20d26a5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1305,7 +1305,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 
 		if (thread_group_leader(p)) {
 			if (clone_flags & CLONE_NEWPID)
-				p->nsproxy->pid_ns->child_reaper = p;
+				p->nsproxy->pid_ns->child_reaper = pid;
 
 			p->signal->leader_pid = pid;
 			p->signal->tty = current->signal->tty;
diff --git a/kernel/pid.c b/kernel/pid.c
index 873c00f..294fc28 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -76,7 +76,7 @@ struct pid_namespace init_pid_ns = {
 	},
 	.last_pid = 0,
 	.level = 0,
-	.child_reaper = &init_task,
+	.child_reaper = &init_struct_pid,
 };
 EXPORT_SYMBOL_GPL(init_pid_ns);
 
@@ -484,6 +484,14 @@ struct pid_namespace *task_active_pid_ns(struct task_struct *tsk)
 }
 EXPORT_SYMBOL_GPL(task_active_pid_ns);
 
+struct task_struct *task_child_reaper(struct task_struct *tsk)
+{
+	struct pid_namespace *ns;
+	BUG_ON(tsk != current);
+	ns = task_active_pid_ns(tsk);
+	return pid_task(ns->child_reaper, PIDTYPE_PID);
+}
+
 /*
  * Used by proc to find the first pid that is greater then or equal to nr.
  *
-- 
1.5.3.rc6.17.g1911

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/9] pid: Update pid_vnr to use task_active_pid_ns
       [not found]                     ` <m1hcioxenh.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  2007-12-12 12:49                       ` [PATCH 6/9] pid: Implement pid_in_pid_ns Eric W. Biederman
@ 2007-12-12 13:28                       ` Pavel Emelyanov
       [not found]                         ` <475FE201.7060104-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
  1 sibling, 1 reply; 52+ messages in thread
From: Pavel Emelyanov @ 2007-12-12 13:28 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Containers, Andrew Morton, Oleg Nesterov

Eric W. Biederman wrote:
> There is no real additional cost in using task_active_pid_ns and
> by doing so it means we do not need to audit the callers of pid_vnr
> closely, to ensure they are always called on living processes.
> 
> Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> ---
>  kernel/pid.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 54d7634..a7ecfce 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -444,7 +444,7 @@ pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
>  
>  pid_t pid_vnr(struct pid *pid)
>  {
> -	return pid_nr_ns(pid, current->nsproxy->pid_ns);
> +	return pid_nr_ns(pid, task_active_pid_ns(current));

NAK. Your version will make more dereferences and one
condition check (if (pid != NULL)). Besides, dead process
can never call this thing - it is even never scheduled 
for execution.

>  }
>  EXPORT_SYMBOL_GPL(pid_vnr);
>  

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 2/4] proc: Simplify proc_get_sb.
       [not found]                                             ` <m1ir34vyaj.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-12 13:30                                               ` Eric W. Biederman
       [not found]                                                 ` <m1ejdsvy54.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-12 13:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Containers, Oleg Nesterov, Pavel Emelyanov


The idle_thread now has a struct pid, so we can always find out know
the pid of the child_reaper before we mount proc.

Therefore we can remove the special cases for getting the pid of the
child_reaper from proc_get_sb.

Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/proc/root.c |   17 +----------------
 1 files changed, 1 insertions(+), 16 deletions(-)

diff --git a/fs/proc/root.c b/fs/proc/root.c
index 81f99e6..f442967 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -46,17 +46,6 @@ static int proc_get_sb(struct file_system_type *fs_type,
 	struct pid_namespace *ns;
 	struct proc_inode *ei;
 
-	if (proc_mnt) {
-		/* Seed the root directory with a pid so it doesn't need
-		 * to be special in base.c.  I would do this earlier but
-		 * the only task alive when /proc is mounted the first time
-		 * is the init_task and it doesn't have any pids.
-		 */
-		ei = PROC_I(proc_mnt->mnt_sb->s_root->d_inode);
-		if (!ei->pid)
-			ei->pid = find_get_pid(1);
-	}
-
 	if (flags & MS_KERNMOUNT)
 		ns = (struct pid_namespace *)data;
 	else
@@ -76,11 +65,7 @@ static int proc_get_sb(struct file_system_type *fs_type,
 		}
 
 		ei = PROC_I(sb->s_root->d_inode);
-		if (!ei->pid) {
-			rcu_read_lock();
-			ei->pid = get_pid(find_pid_ns(1, ns));
-			rcu_read_unlock();
-		}
+		ei->pid = get_pid(ns->child_reaper);
 
 		sb->s_flags |= MS_ACTIVE;
 		ns->proc_mnt = mnt;
-- 
1.5.3.rc6.17.g1911

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 3/4] proc: Remove the unnecessary global proc_mnt
       [not found]                                                 ` <m1ejdsvy54.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-12 13:31                                                   ` Eric W. Biederman
       [not found]                                                     ` <m1abogvy39.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  2007-12-12 13:42                                                   ` [PATCH 2/4] proc: Simplify proc_get_sb Pavel Emelyanov
  1 sibling, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-12 13:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Containers, Oleg Nesterov, Pavel Emelyanov


There are no longer any users of the proc_mnt global
and having it suggests that we have a single instance
of proc, so just kill it.

Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 fs/proc/inode.c         |    2 --
 fs/proc/root.c          |    5 ++---
 include/linux/proc_fs.h |    1 -
 3 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 82b3a1b..96446ac 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -71,8 +71,6 @@ static void proc_delete_inode(struct inode *inode)
 	clear_inode(inode);
 }
 
-struct vfsmount *proc_mnt;
-
 static struct kmem_cache * proc_inode_cachep;
 
 static struct inode *proc_alloc_inode(struct super_block *sb)
diff --git a/fs/proc/root.c b/fs/proc/root.c
index f442967..e4b7c5a 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -97,9 +97,8 @@ void __init proc_root_init(void)
 	err = register_filesystem(&proc_fs_type);
 	if (err)
 		return;
-	proc_mnt = kern_mount_data(&proc_fs_type, &init_pid_ns);
-	err = PTR_ERR(proc_mnt);
-	if (IS_ERR(proc_mnt)) {
+	err = pid_ns_prepare_proc(&init_pid_ns);
+	if (err) {
 		unregister_filesystem(&proc_fs_type);
 		return;
 	}
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index a5ab8ad..40da820 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -128,7 +128,6 @@ extern struct proc_dir_entry *create_proc_entry(const char *name, mode_t mode,
 						struct proc_dir_entry *parent);
 extern void remove_proc_entry(const char *name, struct proc_dir_entry *parent);
 
-extern struct vfsmount *proc_mnt;
 struct pid_namespace;
 extern int proc_fill_super(struct super_block *);
 extern struct inode *proc_get_inode(struct super_block *, unsigned int, struct proc_dir_entry *);
-- 
1.5.3.rc6.17.g1911

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 4/4] pid: Move all of the pid_namespace logic into copy_pid_ns.
       [not found]                                                     ` <m1abogvy39.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-12 13:33                                                       ` Eric W. Biederman
       [not found]                                                         ` <m163z4vxzs.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-12 13:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Containers, Oleg Nesterov, Pavel Emelyanov


Currently we have a bunch of pid namespace logic unnecessarily
scattered through fork, this patch moves it into copy_pid_ns
where it is easy to find and review.

To do this the prototype of copy_pid_ns is changed to take
a task parameter and the pid passed into copy_process
is passed in the task_pid field of that task.

After this cleanup things are actually clean enough we can contemplate
what an unshare of the pid namespace would mean.

Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 include/linux/pid_namespace.h |    5 +++--
 kernel/fork.c                 |   16 +++++-----------
 kernel/nsproxy.c              |    2 +-
 kernel/pid_namespace.c        |   31 ++++++++++++++++++++++++++++---
 4 files changed, 37 insertions(+), 17 deletions(-)

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 2e67033..8e2e346 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -37,7 +37,7 @@ static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
 	return ns;
 }
 
-extern struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *ns);
+extern struct pid_namespace *copy_pid_ns(unsigned long flags, struct task_struct *tsk);
 extern void free_pid_ns(struct kref *kref);
 extern void zap_pid_ns_processes(struct pid_namespace *pid_ns);
 
@@ -56,8 +56,9 @@ static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
 }
 
 static inline struct pid_namespace *
-copy_pid_ns(unsigned long flags, struct pid_namespace *ns)
+copy_pid_ns(unsigned long flags, struct task_struct *task)
 {
+	struct pid_namespace *ns = task->nsproxy->pid_ns;
 	if (flags & CLONE_NEWPID)
 		ns = ERR_PTR(-EINVAL);
 	return ns;
diff --git a/kernel/fork.c b/kernel/fork.c
index 20d26a5..bbd8bcd 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -50,7 +50,6 @@
 #include <linux/taskstats_kern.h>
 #include <linux/random.h>
 #include <linux/tty.h>
-#include <linux/proc_fs.h>
 #include <linux/memcontrol.h>
 
 #include <asm/pgtable.h>
@@ -1013,6 +1012,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	if (!p)
 		goto fork_out;
 
+	/* Remember the pid */
+	p->pids[PIDTYPE_PID].pid = pid;
+
 	rt_mutex_init_task(p);
 
 #ifdef CONFIG_TRACE_IRQFLAGS
@@ -1162,17 +1164,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	if (retval)
 		goto bad_fork_cleanup_namespaces;
 
-	if (pid != &init_struct_pid) {
+	pid = task_pid(p);
+	if (!pid) {
 		retval = -ENOMEM;
 		pid = alloc_pid(p->nsproxy->pid_ns);
 		if (!pid)
 			goto bad_fork_cleanup_namespaces;
-
-		if (clone_flags & CLONE_NEWPID) {
-			retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
-			if (retval < 0)
-				goto bad_fork_free_pid;
-		}
 	}
 
 	p->pid = pid_nr(pid);
@@ -1304,9 +1301,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 			__ptrace_link(p, current->parent);
 
 		if (thread_group_leader(p)) {
-			if (clone_flags & CLONE_NEWPID)
-				p->nsproxy->pid_ns->child_reaper = pid;
-
 			p->signal->leader_pid = pid;
 			p->signal->tty = current->signal->tty;
 			set_task_pgrp(p, task_pgrp_nr(current));
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index f5d332c..7b6d2dc 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -75,7 +75,7 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 		goto out_ipc;
 	}
 
-	new_nsp->pid_ns = copy_pid_ns(flags, task_active_pid_ns(tsk));
+	new_nsp->pid_ns = copy_pid_ns(flags, tsk);
 	if (IS_ERR(new_nsp->pid_ns)) {
 		err = PTR_ERR(new_nsp->pid_ns);
 		goto out_pid;
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 6d792b6..739a72b 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -12,6 +12,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/syscalls.h>
 #include <linux/err.h>
+#include <linux/proc_fs.h>
 
 #define BITS_PER_PAGE		(PAGE_SIZE*8)
 
@@ -115,9 +116,12 @@ static void destroy_pid_namespace(struct pid_namespace *ns)
 	kmem_cache_free(pid_ns_cachep, ns);
 }
 
-struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *old_ns)
+struct pid_namespace *copy_pid_ns(unsigned long flags, struct task_struct *tsk)
 {
+	struct pid_namespace *old_ns = tsk->nsproxy->pid_ns;
 	struct pid_namespace *new_ns;
+	struct pid *pid = NULL;
+	int err;
 
 	BUG_ON(!old_ns);
 	new_ns = get_pid_ns(old_ns);
@@ -129,13 +133,34 @@ struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *old
 		goto out_put;
 
 	new_ns = create_pid_namespace(old_ns->level + 1);
-	if (!IS_ERR(new_ns))
-		new_ns->parent = get_pid_ns(old_ns);
+	if (IS_ERR(new_ns))
+		goto out_put;
+
+	new_ns->parent = get_pid_ns(old_ns);
+	pid = task_pid(tsk);
+	if (!pid) {
+		err = -ENOMEM;
+		tsk->pids[PIDTYPE_PID].pid = pid = alloc_pid(new_ns);
+		if (!pid)
+			goto out_put_new;
+	}
+
+	new_ns->child_reaper = pid;
+	err = pid_ns_prepare_proc(new_ns);
+	if (err)
+		goto out_put_new;
 
 out_put:
 	put_pid_ns(old_ns);
 out:
 	return new_ns;
+
+out_put_new:
+	if (pid)
+		free_pid(pid);
+	put_pid_ns(new_ns);
+	new_ns = ERR_PTR(err);
+	goto out_put;
 }
 
 void free_pid_ns(struct kref *kref)
-- 
1.5.3.rc6.17.g1911

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 6/9] pid: Implement pid_in_pid_ns.
       [not found]                         ` <m1d4tcxelu.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  2007-12-12 12:52                           ` [PATCH 7/9] sig: Handle pid namespace crossing when sending signals Eric W. Biederman
@ 2007-12-12 13:33                           ` Pavel Emelyanov
  1 sibling, 0 replies; 52+ messages in thread
From: Pavel Emelyanov @ 2007-12-12 13:33 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Containers, Andrew Morton, Oleg Nesterov

Eric W. Biederman wrote:
> Implement a simple check to see if a specified pid resides
> in a specified pid namespace.  Generally we are smart enough
> just to operate on single pid namespace and avoid this test.
> However in the case of sending a signal to init we must check
> to see if our sender is a child of init and so this test is needed.
> 
> Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> ---
>  include/linux/pid.h |    3 +++
>  kernel/pid.c        |    6 ++++++
>  2 files changed, 9 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/pid.h b/include/linux/pid.h
> index c4b56c0..e409cc5 100644
> --- a/include/linux/pid.h
> +++ b/include/linux/pid.h
> @@ -156,6 +156,9 @@ static inline pid_t pid_nr(struct pid *pid)
>  pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns);
>  pid_t pid_vnr(struct pid *pid);
>  
> +/* Test to see if pid is present in pid namespace ns */
> +extern int pid_in_pid_ns(struct pid *pid, struct pid_namespace *ns);
> +
>  #define do_each_pid_task(pid, type, task)				\
>  	do {								\
>  		struct hlist_node *pos___;				\
> diff --git a/kernel/pid.c b/kernel/pid.c
> index a7ecfce..873c00f 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -429,6 +429,12 @@ struct pid *find_get_pid(pid_t nr)
>  	return pid;
>  }
>  
> +int pid_in_pid_ns(struct pid *pid, struct pid_namespace *ns)

Can we give it a better name? Like pid_in_ns() if we do want it
to be as short as possible, or pid_visible_in_ns() if we want
it ti reflect its nature.

> +{
> +	return pid && ns && (ns->level <= pid->level) &&
> +		pid->numbers[ns->level].ns == ns;
> +}
> +
>  pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
>  {
>  	struct upid *upid;

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/4] proc: Simplify proc_get_sb.
       [not found]                                                 ` <m1ejdsvy54.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  2007-12-12 13:31                                                   ` [PATCH 3/4] proc: Remove the unnecessary global proc_mnt Eric W. Biederman
@ 2007-12-12 13:42                                                   ` Pavel Emelyanov
       [not found]                                                     ` <475FE53D.6050408-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
  1 sibling, 1 reply; 52+ messages in thread
From: Pavel Emelyanov @ 2007-12-12 13:42 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Containers, Andrew Morton, Oleg Nesterov

Eric W. Biederman wrote:
> The idle_thread now has a struct pid, so we can always find out know
> the pid of the child_reaper before we mount proc.
> 
> Therefore we can remove the special cases for getting the pid of the
> child_reaper from proc_get_sb.
> 
> Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> ---
>  fs/proc/root.c |   17 +----------------
>  1 files changed, 1 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/proc/root.c b/fs/proc/root.c
> index 81f99e6..f442967 100644
> --- a/fs/proc/root.c
> +++ b/fs/proc/root.c
> @@ -46,17 +46,6 @@ static int proc_get_sb(struct file_system_type *fs_type,
>  	struct pid_namespace *ns;
>  	struct proc_inode *ei;
>  
> -	if (proc_mnt) {
> -		/* Seed the root directory with a pid so it doesn't need
> -		 * to be special in base.c.  I would do this earlier but
> -		 * the only task alive when /proc is mounted the first time
> -		 * is the init_task and it doesn't have any pids.
> -		 */
> -		ei = PROC_I(proc_mnt->mnt_sb->s_root->d_inode);
> -		if (!ei->pid)
> -			ei->pid = find_get_pid(1);
> -	}
> -
>  	if (flags & MS_KERNMOUNT)
>  		ns = (struct pid_namespace *)data;
>  	else
> @@ -76,11 +65,7 @@ static int proc_get_sb(struct file_system_type *fs_type,
>  		}
>  
>  		ei = PROC_I(sb->s_root->d_inode);
> -		if (!ei->pid) {
> -			rcu_read_lock();
> -			ei->pid = get_pid(find_pid_ns(1, ns));
> -			rcu_read_unlock();
> -		}
> +		ei->pid = get_pid(ns->child_reaper);

That's not git-bisect safe - you move the child_reaper initialization
before the call to prepare_proc only in the 4th patch.

>  		sb->s_flags |= MS_ACTIVE;
>  		ns->proc_mnt = mnt;

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 0/4] Properly handle talking to all processes in a pid namespace
       [not found]                                                         ` <m163z4vxzs.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-12 13:46                                                           ` Eric W. Biederman
       [not found]                                                             ` <m11w9svxeb.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-12 13:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Containers, Oleg Nesterov, Pavel Emelyanov


Currently when using -1 to mean all processes or all threads
we don't localize that to the pid namespace.  This patchset
corrects the instances I know of.

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 1/4] signal: Introduce kill_pid_ns_info
       [not found]                                                             ` <m11w9svxeb.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-12 13:49                                                               ` Eric W. Biederman
       [not found]                                                                 ` <m1ve74uio4.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-12 13:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Containers, Oleg Nesterov, Pavel Emelyanov


Implement the basic helper function that walks all of the processes in
a pid namespace and sends them all a signal. 

Both locations that could use this functions are also updated to use
this function.

I use find_ge_pid instead of for_each_process because it has a chance
of not touching every process in the system.

Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 include/linux/sched.h  |    2 +
 kernel/pid_namespace.c |   17 +--------------
 kernel/signal.c        |   54 +++++++++++++++++++++++++++++++++++------------
 3 files changed, 43 insertions(+), 30 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index cda9ee9..eef3c22 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1665,6 +1665,8 @@ extern void release_task(struct task_struct * p);
 extern int send_sig_info(int, struct siginfo *, struct task_struct *);
 extern int force_sigsegv(int, struct task_struct *);
 extern int force_sig_info(int, struct siginfo *, struct task_struct *);
+extern int __kill_pid_ns_info(int sig, struct siginfo *info, struct pid_namespace *ns);
+extern int kill_pid_ns_info(int sig, struct siginfo *info, struct pid_namespace *ns);
 extern int __kill_pgrp_info(int sig, struct siginfo *info, struct pid *pgrp);
 extern int kill_pgrp_info(int sig, struct siginfo *info, struct pid *pgrp);
 extern int kill_pid_info(int sig, struct siginfo *info, struct pid *pid);
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 739a72b..67ba069 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -178,29 +178,14 @@ void free_pid_ns(struct kref *kref)
 
 void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 {
-	int nr;
 	int rc;
 
 	/*
 	 * The last thread in the cgroup-init thread group is terminating.
 	 * Find remaining pid_ts in the namespace, signal and wait for them
 	 * to exit.
-	 *
-	 * Note:  This signals each threads in the namespace - even those that
-	 * 	  belong to the same thread group, To avoid this, we would have
-	 * 	  to walk the entire tasklist looking a processes in this
-	 * 	  namespace, but that could be unnecessarily expensive if the
-	 * 	  pid namespace has just a few processes. Or we need to
-	 * 	  maintain a tasklist for each pid namespace.
-	 *
 	 */
-	read_lock(&tasklist_lock);
-	nr = next_pidmap(pid_ns, 1);
-	while (nr > 0) {
-		kill_proc_info(SIGKILL, SEND_SIG_PRIV, nr);
-		nr = next_pidmap(pid_ns, nr);
-	}
-	read_unlock(&tasklist_lock);
+	kill_pid_ns_info(SIGKILL, SEND_SIG_PRIV, pid_ns);
 
 	do {
 		clear_thread_flag(TIF_SIGPENDING);
diff --git a/kernel/signal.c b/kernel/signal.c
index 074905f..1eb0661 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1083,6 +1083,45 @@ int group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
 	return ret;
 }
 
+int __kill_pid_ns_info(int sig, struct siginfo *info, struct pid_namespace *ns)
+{
+	int retval = 0, count = 0;
+	struct task_struct *p;
+	struct pid *pid;
+	int nr;
+
+	/* Since there isn't a pid namespace list of tasks use the closet
+	 * approximation we have: find_ge_pid.
+	 */
+	nr = 0;
+	while ((pid = find_ge_pid(nr + 1, ns))) {
+		int err;
+
+		nr = pid_nr_ns(pid, ns);
+		p = pid_task(pid, PIDTYPE_PID);
+		if (!p || (nr <= 1) || !thread_group_leader(p) ||
+		    same_thread_group(p, current))
+			continue;
+
+		err = group_send_sig_info(sig, info, p);
+		++count;
+		if (err != -EPERM)
+			retval = err;
+	}
+	return count ? retval : -ESRCH;
+}
+
+int kill_pid_ns_info(int sig, struct siginfo *info, struct pid_namespace *ns)
+{
+	int retval;
+
+	read_lock(&tasklist_lock);
+	retval = __kill_pid_ns_info(sig, info, ns);
+	read_unlock(&tasklist_lock);
+
+	return retval;
+}
+
 /*
  * kill_pgrp_info() sends a signal to a process group: this is what the tty
  * control characters do (^C, ^Z etc)
@@ -1204,20 +1243,7 @@ static int kill_something_info(int sig, struct siginfo *info, int pid)
 	if (!pid) {
 		ret = kill_pgrp_info(sig, info, task_pgrp(current));
 	} else if (pid == -1) {
-		int retval = 0, count = 0;
-		struct task_struct * p;
-
-		read_lock(&tasklist_lock);
-		for_each_process(p) {
-			if (p->pid > 1 && !same_thread_group(p, current)) {
-				int err = group_send_sig_info(sig, info, p);
-				++count;
-				if (err != -EPERM)
-					retval = err;
-			}
-		}
-		read_unlock(&tasklist_lock);
-		ret = count ? retval : -ESRCH;
+		ret = kill_pid_ns_info(sig, info, task_active_pid_ns(current));
 	} else if (pid < 0) {
 		ret = kill_pgrp_info(sig, info, find_vpid(-pid));
 	} else {
-- 
1.5.3.rc6.17.g1911

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 2/4] pid: Make next_pidmap static again
       [not found]                                                                 ` <m1ve74uio4.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-12 13:50                                                                   ` Eric W. Biederman
       [not found]                                                                     ` <m1r6hsuime.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  2007-12-12 16:09                                                                   ` [PATCH 1/4] signal: Introduce kill_pid_ns_info Pavel Emelyanov
  1 sibling, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-12 13:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Containers, Oleg Nesterov, Pavel Emelyanov


Now that we are using kill_pid_ns_info in zap_pid_ns_processes
the only user of next_pidmap is pid.c so we can make
next_pidmap static.

Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 include/linux/pid.h |    1 -
 kernel/pid.c        |    2 +-
 2 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index e409cc5..b18c15c 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -118,7 +118,6 @@ extern struct pid *find_pid(int nr);
  */
 extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
-int next_pidmap(struct pid_namespace *pid_ns, int last);
 
 extern struct pid *alloc_pid(struct pid_namespace *ns);
 extern void FASTCALL(free_pid(struct pid *pid));
diff --git a/kernel/pid.c b/kernel/pid.c
index 294fc28..5a1f0d0 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -180,7 +180,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	return -1;
 }
 
-int next_pidmap(struct pid_namespace *pid_ns, int last)
+static int next_pidmap(struct pid_namespace *pid_ns, int last)
 {
 	int offset;
 	struct pidmap *map, *end;
-- 
1.5.3.rc6.17.g1911

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 3/4] Fix the indentation in cap_set_all to use tabs.
       [not found]                                                                     ` <m1r6hsuime.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-12 13:52                                                                       ` Eric W. Biederman
       [not found]                                                                         ` <m1mysguijx.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-12 13:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Containers, Oleg Nesterov, Pavel Emelyanov


So my next patch is readable...

Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 kernel/capability.c |   34 +++++++++++++++++-----------------
 1 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/kernel/capability.c b/kernel/capability.c
index 39e8193..652a2c5 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -198,24 +198,24 @@ static inline int cap_set_all(kernel_cap_t *effective,
 			       kernel_cap_t *inheritable,
 			       kernel_cap_t *permitted)
 {
-     struct task_struct *g, *target;
-     int ret = -EPERM;
-     int found = 0;
-
-     do_each_thread(g, target) {
-             if (target == current || is_container_init(target->group_leader))
-                     continue;
-             found = 1;
-	     if (security_capset_check(target, effective, inheritable,
+	struct task_struct *g, *target;
+	int ret = -EPERM;
+	int found = 0;
+
+	do_each_thread(g, target) {
+		if (target == current || is_container_init(target->group_leader))
+			continue;
+		found = 1;
+		if (security_capset_check(target, effective, inheritable,
 						permitted))
-		     continue;
-	     ret = 0;
-	     security_capset_set(target, effective, inheritable, permitted);
-     } while_each_thread(g, target);
-
-     if (!found)
-	     ret = 0;
-     return ret;
+			continue;
+		ret = 0;
+		security_capset_set(target, effective, inheritable, permitted);
+	} while_each_thread(g, target);
+
+	if (!found)
+		ret = 0;
+	return ret;
 }
 
 /**
-- 
1.5.3.rc6.17.g1911

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH 4/4] pid: Limit cap_set_all to the current pid namespace
       [not found]                                                                         ` <m1mysguijx.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-12 13:56                                                                           ` Eric W. Biederman
  0 siblings, 0 replies; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-12 13:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linux Containers, Oleg Nesterov, Pavel Emelyanov


Use find_ge_pid in cap_set_all we only consider tasks in the current
pid namespace. 

This doesn't walk the task_list so on a good day in a
pid namespace it should be more scalable, more importantly
this is the same idiom used in proc and kill_pid_ns_info so it
should be reasonably maintainable.

Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 kernel/capability.c |   15 +++++++++++----
 1 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/capability.c b/kernel/capability.c
index 652a2c5..a11bb24 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -198,12 +198,19 @@ static inline int cap_set_all(kernel_cap_t *effective,
 			       kernel_cap_t *inheritable,
 			       kernel_cap_t *permitted)
 {
-	struct task_struct *g, *target;
+	struct pid_namespace *ns = task_active_pid_ns(current);
+	struct task_struct *target;
 	int ret = -EPERM;
 	int found = 0;
+	struct pid *pid;
+	int nr;
 
-	do_each_thread(g, target) {
-		if (target == current || is_container_init(target->group_leader))
+	nr = 0;
+	while ((pid = find_ge_pid(nr + 1, ns))) {
+		nr = pid_nr_ns(pid, ns);
+		target = pid_task(pid, PIDTYPE_PID);
+
+		if (target == current || task_tgid(target) == ns->child_reaper)
 			continue;
 		found = 1;
 		if (security_capset_check(target, effective, inheritable,
@@ -211,7 +218,7 @@ static inline int cap_set_all(kernel_cap_t *effective,
 			continue;
 		ret = 0;
 		security_capset_set(target, effective, inheritable, permitted);
-	} while_each_thread(g, target);
+	}
 
 	if (!found)
 		ret = 0;
-- 
1.5.3.rc6.17.g1911

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/4] proc: Simplify proc_get_sb.
       [not found]                                                     ` <475FE53D.6050408-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
@ 2007-12-12 14:06                                                       ` Eric W. Biederman
  0 siblings, 0 replies; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-12 14:06 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux Containers, Andrew Morton, Oleg Nesterov

Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> writes:

> Eric W. Biederman wrote:
>> The idle_thread now has a struct pid, so we can always find out know
>> the pid of the child_reaper before we mount proc.
>> 
>> Therefore we can remove the special cases for getting the pid of the
>> child_reaper from proc_get_sb.
>> 
>> Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>> ---
>>  fs/proc/root.c |   17 +----------------
>>  1 files changed, 1 insertions(+), 16 deletions(-)
>> 
>> diff --git a/fs/proc/root.c b/fs/proc/root.c
>> index 81f99e6..f442967 100644
>> --- a/fs/proc/root.c
>> +++ b/fs/proc/root.c
>> @@ -46,17 +46,6 @@ static int proc_get_sb(struct file_system_type *fs_type,
>>  	struct pid_namespace *ns;
>>  	struct proc_inode *ei;
>>  
>> -	if (proc_mnt) {
>> -		/* Seed the root directory with a pid so it doesn't need
>> -		 * to be special in base.c.  I would do this earlier but
>> -		 * the only task alive when /proc is mounted the first time
>> -		 * is the init_task and it doesn't have any pids.
>> -		 */
>> -		ei = PROC_I(proc_mnt->mnt_sb->s_root->d_inode);
>> -		if (!ei->pid)
>> -			ei->pid = find_get_pid(1);
>> -	}
>> -
>>  	if (flags & MS_KERNMOUNT)
>>  		ns = (struct pid_namespace *)data;
>>  	else
>> @@ -76,11 +65,7 @@ static int proc_get_sb(struct file_system_type *fs_type,
>>  		}
>>  
>>  		ei = PROC_I(sb->s_root->d_inode);
>> -		if (!ei->pid) {
>> -			rcu_read_lock();
>> -			ei->pid = get_pid(find_pid_ns(1, ns));
>> -			rcu_read_unlock();
>> -		}
>> +		ei->pid = get_pid(ns->child_reaper);
>
> That's not git-bisect safe - you move the child_reaper initialization
> before the call to prepare_proc only in the 4th patch.

Bugger.  I had overlooked that dependency.  So patch 4 really should have been
patch 2.

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 5/9] pid: Update pid_vnr to use task_active_pid_ns
       [not found]                         ` <475FE201.7060104-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
@ 2007-12-12 14:20                           ` Eric W. Biederman
  0 siblings, 0 replies; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-12 14:20 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux Containers, Andrew Morton, Oleg Nesterov

> NAK. Your version will make more dereferences and one
> condition check (if (pid != NULL)). Besides, dead process
> can never call this thing - it is even never scheduled 
> for execution.

Andrew please disregard this patch, it should not affect
the rest of the patch series.

Well EXIT_ZOMBIE is enough to cause us problems, not EXIT_DEAD,
so we can be scheduled.

Still we don't have any users currently and I actually don't expect us
to develop any, so this patch was just overkill right now. 

I was thinking that there was a path to send_signal when
we were zombie with SEND_SIG_NOINFO, but that doesn't seem
to exist, and if there isn't one now we aren't likely
to grow one, and my signal patches remove the possibility
of that path being a problem.

Hmm.  As to the cost argument.
Both pid->level and pid->numbers[pid->level].ns should be
on the same cache line so it should not be any real
extra memory reads.  The branch should be predicted not
to happen so again it should not have a cost except a small
bit of code size.

So while I don't expect any real world cost this is the
common case so we don't want to put to many hoops in our
way either.

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 1/4] signal: Introduce kill_pid_ns_info
       [not found]                                                                 ` <m1ve74uio4.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  2007-12-12 13:50                                                                   ` [PATCH 2/4] pid: Make next_pidmap static again Eric W. Biederman
@ 2007-12-12 16:09                                                                   ` Pavel Emelyanov
  1 sibling, 0 replies; 52+ messages in thread
From: Pavel Emelyanov @ 2007-12-12 16:09 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Containers, Andrew Morton, Oleg Nesterov

Eric W. Biederman wrote:
> Implement the basic helper function that walks all of the processes in
> a pid namespace and sends them all a signal. 
> 
> Both locations that could use this functions are also updated to use
> this function.
> 
> I use find_ge_pid instead of for_each_process because it has a chance
> of not touching every process in the system.
> 
> Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

Looks sane.

Acked-by: Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>

but one comment inside.

[snip]

> diff --git a/kernel/signal.c b/kernel/signal.c
> index 074905f..1eb0661 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1083,6 +1083,45 @@ int group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
>  	return ret;
>  }
>  
> +int __kill_pid_ns_info(int sig, struct siginfo *info, struct pid_namespace *ns)

Shouldn't it be static?

> +{
> +	int retval = 0, count = 0;
> +	struct task_struct *p;
> +	struct pid *pid;
> +	int nr;
> +
> +	/* Since there isn't a pid namespace list of tasks use the closet
> +	 * approximation we have: find_ge_pid.
> +	 */
> +	nr = 0;
> +	while ((pid = find_ge_pid(nr + 1, ns))) {
> +		int err;
> +
> +		nr = pid_nr_ns(pid, ns);
> +		p = pid_task(pid, PIDTYPE_PID);
> +		if (!p || (nr <= 1) || !thread_group_leader(p) ||
> +		    same_thread_group(p, current))
> +			continue;
> +
> +		err = group_send_sig_info(sig, info, p);
> +		++count;
> +		if (err != -EPERM)
> +			retval = err;
> +	}
> +	return count ? retval : -ESRCH;
> +}
> +

[snip]

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 8/9] signal: Drop signals before sending them to init.
       [not found]                                 ` <m13au8xe8m.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  2007-12-12 12:58                                   ` [PATCH 9/9] signal: Ignore signals sent to the pid namespace init Eric W. Biederman
@ 2007-12-12 19:00                                   ` Serge E. Hallyn
       [not found]                                     ` <20071212190042.GA22469-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
  2007-12-13 16:25                                   ` Oleg Nesterov
  2 siblings, 1 reply; 52+ messages in thread
From: Serge E. Hallyn @ 2007-12-12 19:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, Andrew Morton, Oleg Nesterov, Pavel Emelyanov

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> 
> Currently init drops all signals including SIGKILL that are sent to it
> that it does not ignore and does have a handler for.  Extending this to

Description is crucial in this patchset, so should the above not be:

	"does not have a handler for"

?

Otherwise, this whole patchset looks good to me (deferring to Pavel for
his NAK on patch 5 of course).

Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

thanks,
-serge

> pid namespaces where we want to maintain this semantic for signals sent
> from inside the pid namespace (descendents of init) but we want the
> namespace init to appear as a normal process is problematic, as a
> naive approach requires always tracking the sender of a signal.
> 
> By making the rule (for init dropping signals):
> When sending a signal to init, the presence of a signal handler that
> is not SIG_DFL allows the signal to be sent to init.  If the signal
> is not sent it is silently dropped without becoming pending.
> 
> The only noticeable user space difference from todays init is that it
> no longer needs to worry about signals becoming pending when it has
> them marked as SIG_DFL and blocked.
> 
> This change by making the presence of a signal handler effectively
> a permission check allows us to do all of the work before we enqueue
> the signal, and there is no need for any fancy tracking of the
> signal sender.
> 
> Which means that we can now allow force_sig_info to send signals to
> init, that panic the kernel instead of going into an infinite busy
> loop taking an exception sending a signal and then retaking the same
> exception, eating all of the cpu time but accomplishing nothing.
> 
> This change also makes it possible to easily implement the desired
> semantics of /sbin/init for pid namespaces where outer processes can
> kill init but processes inside the pid namespace can not.
> 
> While it is now easy to remove the dropping of signals from individual
> code paths such as force_sig_info this patch does not implement that,
> to retain as much of the current behavior as possible.  The only
> behavioral difference besides not queuing blocked SIG_DFL signals
> are signals directly added with sigaddset.  In practice a threaded init
> now receives SIGKILL sent from a core dump, a thread group exit, or an
> exec shutting down extraneous threads.
> 
> This patch was inspired by a patch from Oleg and initial refined
> in conversation with Suka and others on the containers list.
> 
> Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> ---
>  kernel/signal.c |   47 ++++++++++++++++++++++++++++++++++++++---------
>  1 files changed, 38 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/signal.c b/kernel/signal.c
> index c01e3cd..029a45d 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -64,6 +64,25 @@ static int sig_ignored(struct task_struct *t, int sig)
>  		(handler == SIG_DFL && sig_kernel_ignore(sig));
>  }
> 
> +static int is_sig_init(struct task_struct *tsk)
> +{
> +	if (likely(!is_global_init(tsk->group_leader)))
> +		return 0;
> +
> +	return 1;
> +}
> +
> +static int sig_init_drop(struct task_struct *tsk, int sig)
> +{
> +	/* All signals for which init has a SIG_DFL handler are
> +	 * silently dropped without being sent.
> +	 */
> +	if (!is_sig_init(tsk))
> +		return 0;
> +
> +	return (tsk->sighand->action[sig-1].sa.sa_handler == SIG_DFL);
> +}
> +
>  /*
>   * Re-calculate pending state from the set of locally pending
>   * signals, globally pending signals, and blocked signals.
> @@ -834,6 +853,9 @@ force_sig_info(int sig, struct siginfo *info, struct task_struct *t)
>  	struct k_sigaction *action;
> 
>  	spin_lock_irqsave(&t->sighand->siglock, flags);
> +	ret = 0;
> +	if (sig_init_drop(t, sig))
> +		goto out;
>  	action = &t->sighand->action[sig-1];
>  	ignored = action->sa.sa_handler == SIG_IGN;
>  	blocked = sigismember(&t->blocked, sig);
> @@ -845,6 +867,7 @@ force_sig_info(int sig, struct siginfo *info, struct task_struct *t)
>  		}
>  	}
>  	ret = specific_send_sig_info(sig, info, t);
> +out:
>  	spin_unlock_irqrestore(&t->sighand->siglock, flags);
> 
>  	return ret;
> @@ -962,6 +985,9 @@ __group_send_sig_info(int sig, struct siginfo *info, struct task_struct *p,
>  	int ret = 0;
> 
>  	assert_spin_locked(&p->sighand->siglock);
> +	if (sig_init_drop(p, sig))
> +		return ret;
> +
>  	handle_stop_signal(sig, p);
> 
>  	/* Short-circuit ignored signals.  */
> @@ -1224,7 +1250,9 @@ send_sig_info(int sig, struct siginfo *info, struct task_struct *p)
>  	 */
>  	read_lock(&tasklist_lock);  
>  	spin_lock_irqsave(&p->sighand->siglock, flags);
> -	ret = specific_send_sig_info(sig, info, p);
> +	ret = 0;
> +	if (!sig_init_drop(p, sig))
> +		ret = specific_send_sig_info(sig, info, p);
>  	spin_unlock_irqrestore(&p->sighand->siglock, flags);
>  	read_unlock(&tasklist_lock);
>  	return ret;
> @@ -1392,6 +1420,11 @@ send_group_sigqueue(int sig, struct sigqueue *q, struct task_struct *p)
>  	read_lock(&tasklist_lock);
>  	/* Since it_lock is held, p->sighand cannot be NULL. */
>  	spin_lock_irqsave(&p->sighand->siglock, flags);
> +	if (sig_init_drop(p, sig)) {
> +		ret = 1;
> +		goto out;
> +	}
> +
>  	handle_stop_signal(sig, p);
> 
>  	/* Short-circuit ignored signals.  */
> @@ -1844,12 +1877,6 @@ relock:
>  		if (sig_kernel_ignore(signr)) /* Default is nothing. */
>  			continue;
> 
> -		/*
> -		 * Global init gets no signals it doesn't want.
> -		 */
> -		if (is_global_init(current))
> -			continue;
> -
>  		if (sig_kernel_stop(signr)) {
>  			/*
>  			 * The default action is to stop all threads in
> @@ -2272,8 +2299,10 @@ static int do_tkill(int tgid, int pid, int sig)
>  		 */
>  		if (!error && sig && p->sighand) {
>  			spin_lock_irq(&p->sighand->siglock);
> -			handle_stop_signal(sig, p);
> -			error = specific_send_sig_info(sig, &info, p);
> +			if (!sig_init_drop(p, sig)) {
> +				handle_stop_signal(sig, p);
> +				error = specific_send_sig_info(sig, &info, p);
> +			}
>  			spin_unlock_irq(&p->sighand->siglock);
>  		}
>  	}
> -- 
> 1.5.3.rc6.17.g1911

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 8/9] signal: Drop signals before sending them to init.
       [not found]                                     ` <20071212190042.GA22469-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
@ 2007-12-12 19:33                                       ` Eric W. Biederman
  0 siblings, 0 replies; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-12 19:33 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Linux Containers, Andrew Morton, Oleg Nesterov, Pavel Emelyanov

"Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:

> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
>> 
>> Currently init drops all signals including SIGKILL that are sent to it
>> that it does not ignore and does have a handler for.  Extending this to
>
> Description is crucial in this patchset, so should the above not be:
>
> 	"does not have a handler for"

Yes.  It should say "does not have a handler for."  Instead of
"does have a handler for."  Oops.  I thought that was what I wrote!

>
> ?

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/9] pid: Implement ns_of_pid.
       [not found]             ` <m1prxcxeum.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  2007-12-12 12:46               ` [PATCH 4/9] pid: Generalize task_active_pid_ns Eric W. Biederman
@ 2007-12-13  0:59               ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA
       [not found]                 ` <20071213005945.GB27896-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 52+ messages in thread
From: sukadev-r/Jw6+rmf7HQT0dZR+AlfA @ 2007-12-13  0:59 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, Andrew Morton, Oleg Nesterov, Pavel Emelyanov

Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
| 
| A current problem with the pid namespace is that it is
| easy to do pid related work after exit_task_namespaces which
| drops the nsproxy pointer.
| 
| However if we are doing pid namespace related work we are
| always operating on some struct pid which retains the pid_namespace
| pointer of the pid namespace it was allocated in.
| 
| So provide ns_of_pid which allows us to find the pid
| namespace a pid was allocated in.
| 
| Using this we have the needed infrastructure to do pid
| namespace related work at anytime we have a struct pid,
| removing the chance of accidentally having a NULL
| pointer dereference when accessing current->nsproxy.

Yep.

| 
| Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
| ---
|  include/linux/pid.h |   11 +++++++++++
|  1 files changed, 11 insertions(+), 0 deletions(-)
| 
| diff --git a/include/linux/pid.h b/include/linux/pid.h
| index f84d532..c4b56c0 100644
| --- a/include/linux/pid.h
| +++ b/include/linux/pid.h
| @@ -123,6 +123,17 @@ int next_pidmap(struct pid_namespace *pid_ns, int last);
|  extern struct pid *alloc_pid(struct pid_namespace *ns);
|  extern void FASTCALL(free_pid(struct pid *pid));
| 
| +/* ns_of_pid returns the pid namespace in which the specified
| + * pid was allocated.
| + */
| +static inline struct pid_namespace *ns_of_pid(struct pid *pid)

My patch refers to this function as pid_active_pid_ns() - I have
been meaning to send that out on top of your signals patch.
Since a pid has many namespaces, we have been using 'active pid ns'
to refer to this ns.

Even your next patch modifies task_active_pid_ns() to use this.
So can we rename this functio to pid_active_pid_ns() ?

| +{
| +	struct pid_namespace *ns = NULL;
| +	if (pid)
| +		ns = pid->numbers[pid->level].ns;
| +	return ns;
| +}
| +
|  /*
|   * the helpers to get the pid's id seen from different namespaces
|   *
| -- 
| 1.5.3.rc6.17.g1911

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/9] pid: Implement ns_of_pid.
       [not found]                 ` <20071213005945.GB27896-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2007-12-13  1:25                   ` Eric W. Biederman
       [not found]                     ` <m1ve73s7vr.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-13  1:25 UTC (permalink / raw)
  To: sukadev-r/Jw6+rmf7HQT0dZR+AlfA
  Cc: Linux Containers, Andrew Morton, Oleg Nesterov, Pavel Emelyanov

sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org writes:

>
> My patch refers to this function as pid_active_pid_ns() - I have
> been meaning to send that out on top of your signals patch.
> Since a pid has many namespaces, we have been using 'active pid ns'
> to refer to this ns.

Currently we don't ask for any of the others, and the namespace
the pid came from is special.  That fundamentally is the namespace
of the pid.  The rest byproducts of being in that pid namespace,
as we could derive them by walking the namespace's parent list.

> Even your next patch modifies task_active_pid_ns() to use this.
> So can we rename this functio to pid_active_pid_ns() ?

I'd be more inclined to rename task_active_pid_ns to task_pid_ns.

And to rename pid_in_pid_ns that Pavel has issues with to pid_in_ns.

When I read active_pid_ns I wonder what the other namespaces are
that we are distinguishing this from.  They do exist in the
implementation but so far it is a complete don't care.

So I expect being as terse as we can while still conveying all of the
relevant information is the most maintainable long term.

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/9] pid: Implement ns_of_pid.
       [not found]                     ` <m1ve73s7vr.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-13  3:28                       ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA
       [not found]                         ` <20071213032827.GA1433-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: sukadev-r/Jw6+rmf7HQT0dZR+AlfA @ 2007-12-13  3:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, Andrew Morton, Oleg Nesterov, Pavel Emelyanov

Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
| sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org writes:
| 
| >
| > My patch refers to this function as pid_active_pid_ns() - I have
| > been meaning to send that out on top of your signals patch.
| > Since a pid has many namespaces, we have been using 'active pid ns'
| > to refer to this ns.
| 
| Currently we don't ask for any of the others, and the namespace
| the pid came from is special.  That fundamentally is the namespace
| of the pid.  The rest byproducts of being in that pid namespace,
| as we could derive them by walking the namespace's parent list.
| 
| > Even your next patch modifies task_active_pid_ns() to use this.
| > So can we rename this functio to pid_active_pid_ns() ?
| 
| I'd be more inclined to rename task_active_pid_ns to task_pid_ns.
| 
| And to rename pid_in_pid_ns that Pavel has issues with to pid_in_ns.
| 
| When I read active_pid_ns I wonder what the other namespaces are
| that we are distinguishing this from.  They do exist in the
| implementation but so far it is a complete don't care.

Well, there are interfaces like pid_nr_ns() and pid_in_ns() and
task_in_pid_ns() that imply existence of other namespaces and
for that reason we added 'active' in the name. 

But I am fine with the terse name and of course we should remove
the the 'active' in task_active_pid_ns() also.

| 
| So I expect being as terse as we can while still conveying all of the
| relevant information is the most maintainable long term.
| 
| Eric

I did some initial testing on your patchset (minus patch 5) and noticed
that it seems to be missing the patch to address kill -1 semantics
(here is the earlier version). 

---

This patch implements task_in_pid_ns and uses it to limit cap_set_all
and sys_kill(-1,) to only those tasks in the current pid namespace.

Without this we have a setup for a very nasty surprise.

Signed-off-by: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
---
 include/linux/pid_namespace.h |    2 ++
 kernel/capability.c           |    3 +++
 kernel/pid.c                  |   11 +++++++++++
 kernel/signal.c               |    5 ++++-
 4 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 0227e68..b454678 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -78,4 +78,6 @@ static inline struct task_struct *task_child_reaper(struct task_struct *tsk)
 	return tsk->nsproxy->pid_ns->child_reaper;
 }

+extern int task_in_pid_ns(struct task_struct *tsk, struct pid_namespace *ns);
+
 #endif /* _LINUX_PID_NS_H */
diff --git a/kernel/capability.c b/kernel/capability.c
index efbd9cd..a801016 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -125,6 +125,7 @@ static inline int cap_set_all(kernel_cap_t *effective,
 			       kernel_cap_t *inheritable,
 			       kernel_cap_t *permitted)
 {
+     struct pid_namespace *pid_ns = current->nsproxy->pid_ns;
      struct task_struct *g, *target;
      int ret = -EPERM;
      int found = 0;
@@ -132,6 +133,8 @@ static inline int cap_set_all(kernel_cap_t *effective,
      do_each_thread(g, target) {
              if (target == current || is_container_init(target->group_leader))
                      continue;
+             if (!task_in_pid_ns(target, pid_ns))
+		     continue;
              found = 1;
 	     if (security_capset_check(target, effective, inheritable,
 						permitted))
diff --git a/kernel/pid.c b/kernel/pid.c
index f815455..1c332ca 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -430,6 +430,17 @@ struct pid *find_get_pid(pid_t nr)
 	return pid;
 }

+static int pid_in_pid_ns(struct pid *pid, struct pid_namespace *ns)
+{
+	return pid && (ns->level <= pid->level) &&
+		pid->numbers[ns->level].ns == ns;
+}
+
+int task_in_pid_ns(struct task_struct *task, struct pid_namespace *ns)
+{
+	return pid_in_pid_ns(task_pid(task), ns);
+}
+
 pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
 {
 	struct upid *upid;
diff --git a/kernel/signal.c b/kernel/signal.c
index 1200630..8f5a31f 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1147,10 +1147,13 @@ static int kill_something_info(int sig, struct siginfo *info, int pid)
 	} else if (pid == -1) {
 		int retval = 0, count = 0;
 		struct task_struct * p;
+		struct pid_namespace *ns = current->nsproxy->pid_ns;

 		read_lock(&tasklist_lock);
 		for_each_process(p) {
-			if (p->pid > 1 && !same_thread_group(p, current)) {
+			if (!is_container_init(p) &&
+			    !same_thread_group(p, current) &&
+			    task_in_pid_ns(p, ns)) {
 				int err = group_send_sig_info(sig, info, p);
 				++count;
 				if (err != -EPERM)
-- 
1.5.3.rc6.17.g1911

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/9] pid: Generalize task_active_pid_ns
       [not found]                 ` <m1lk80xeps.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  2007-12-12 12:48                   ` [PATCH 5/9] pid: Update pid_vnr to use task_active_pid_ns Eric W. Biederman
@ 2007-12-13 16:01                   ` Oleg Nesterov
       [not found]                     ` <20071213160128.GA219-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org>
  1 sibling, 1 reply; 52+ messages in thread
From: Oleg Nesterov @ 2007-12-13 16:01 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Containers, Andrew Morton, Pavel Emelyanov

Sorry for the delay, and sorry, can't read this series carefully now.
A couple of question though.

On 12/12, Eric W. Biederman wrote:
>
> Currently task_active_pid_ns is not safe to call after a
> task becomes a zombie and exit_task_namespaces is called,
> as nsproxy becomes NULL.  By reading the pid namespace from
> the pid of the task we can trivially solve this problem

Confused. If the task becomes a zombie, we can't assume it has a valid
->pids[].pid. The parent can release us as soon as exit_notify() drops
tasklist.

?

Oleg.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/9] pid: Generalize task_active_pid_ns
       [not found]                     ` <20071213160128.GA219-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org>
@ 2007-12-13 16:22                       ` Eric W. Biederman
       [not found]                         ` <m1mysesgxc.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-13 16:22 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Linux Containers, Andrew Morton, Pavel Emelyanov

Oleg Nesterov <oleg-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org> writes:

> Sorry for the delay, and sorry, can't read this series carefully now.
> A couple of question though.
>
> On 12/12, Eric W. Biederman wrote:
>>
>> Currently task_active_pid_ns is not safe to call after a
>> task becomes a zombie and exit_task_namespaces is called,
>> as nsproxy becomes NULL.  By reading the pid namespace from
>> the pid of the task we can trivially solve this problem
>
> Confused. If the task becomes a zombie, we can't assume it has a valid
> ->pids[].pid. The parent can release us as soon as exit_notify() drops
> tasklist.

Where this really matters is in the signal sending code.  By the time
I have acquired sighand lock I know release_task has executed and
thus my ->pids[].pid is valid, because __exit_signal has not completed
and thus __unhash_process has not yet run.

When release_task_gets called we are EXIT_DEAD unhashed and unfindable
and I don't care.  I do however care about finding my pid namespace
as long as the task is on hashed.

So as long as we have tasklist_lock or sighand lock and we can find
the task we are good.

What this allows me to do (as seen later in the patchset) is to send
to call pid_nr_ns and deliver a signal to a task group without caring
if the element of the task group I am talking to is a zombie or not.
Which is rather important when the task group leader has exited
and a zombie, but yet it is the task all of the signals are sent
to and group_send_siginfo is called on.

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 8/9] signal: Drop signals before sending them to init.
       [not found]                                 ` <m13au8xe8m.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  2007-12-12 12:58                                   ` [PATCH 9/9] signal: Ignore signals sent to the pid namespace init Eric W. Biederman
  2007-12-12 19:00                                   ` [PATCH 8/9] signal: Drop signals before sending them to init Serge E. Hallyn
@ 2007-12-13 16:25                                   ` Oleg Nesterov
       [not found]                                     ` <20071213162502.GB219-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org>
  2 siblings, 1 reply; 52+ messages in thread
From: Oleg Nesterov @ 2007-12-13 16:25 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Containers, Andrew Morton, Pavel Emelyanov

On 12/12, Eric W. Biederman wrote:
> 
> By making the rule (for init dropping signals):
> When sending a signal to init, the presence of a signal handler that
> is not SIG_DFL allows the signal to be sent to init.  If the signal
> is not sent it is silently dropped without becoming pending.

But isn't it better to modify sig_ignore() and handle_stop_signal()
instead? This way we seem to need less changes,

	http://marc.info/?l=linux-kernel&m=118753610515859

(the patch above itself is not complete and a bit obsolete)

> The only noticeable user space difference from todays init is that it
> no longer needs to worry about signals becoming pending when it has
> them marked as SIG_DFL and blocked.

Ugh. I have to apologize again. I got a fever, and it turns out I just
can't read English.

So, do you mean we can ignore the problems with the signals which are
currently blocked by /sbin/init?

I personally agree, but I'm not sure I understand this right.

> +static int sig_init_drop(struct task_struct *tsk, int sig)
> +{
> +	/* All signals for which init has a SIG_DFL handler are
> +	 * silently dropped without being sent.
> +	 */
> +	if (!is_sig_init(tsk))
> +		return 0;
> +
> +	return (tsk->sighand->action[sig-1].sa.sa_handler == SIG_DFL);
> +}

What if /sbin/init has a handler, but before this signal is delivered
/sbin/init does signal(SIG_DFL) ? We should modify so_sigaction() to
prevent this. Note again the patch above.

Oleg.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 9/9] signal: Ignore signals sent to the pid namespace init
       [not found]                                     ` <m1y7c0vzm4.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  2007-12-12 13:09                                       ` [PATCH 0/4] pid namespace infrastructure cleanups Eric W. Biederman
@ 2007-12-13 16:28                                       ` Oleg Nesterov
       [not found]                                         ` <20071213162811.GC219-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org>
  1 sibling, 1 reply; 52+ messages in thread
From: Oleg Nesterov @ 2007-12-13 16:28 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Containers, Andrew Morton, Pavel Emelyanov

On 12/12, Eric W. Biederman wrote:
> 
> -static int is_sig_init(struct task_struct *tsk)
> +static int is_sig_init(struct task_struct *init, struct pid *sender)
>  {
> -	if (likely(!is_global_init(tsk->group_leader)))
> +	if (!is_container_init(init))
> +		return 0;
> +
> +	if (!sender)
> +		sender = task_tgid(current);

What if this signal is sent from the interrupt and sender == NULL?

> +
> +	if (!pid_in_pid_ns(sender, task_active_pid_ns(init)))
>  		return 0;

In that case the result of the above check can be wrong, no?

Oleg.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 4/9] pid: Generalize task_active_pid_ns
       [not found]                         ` <m1mysesgxc.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-13 17:07                           ` Oleg Nesterov
  0 siblings, 0 replies; 52+ messages in thread
From: Oleg Nesterov @ 2007-12-13 17:07 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Containers, Andrew Morton, Pavel Emelyanov

On 12/13, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org> writes:
> 
> > On 12/12, Eric W. Biederman wrote:
> >>
> >> Currently task_active_pid_ns is not safe to call after a
> >> task becomes a zombie and exit_task_namespaces is called,
> >> as nsproxy becomes NULL.  By reading the pid namespace from
> >> the pid of the task we can trivially solve this problem
> >
> > Confused. If the task becomes a zombie, we can't assume it has a valid
> > ->pids[].pid. The parent can release us as soon as exit_notify() drops
> > tasklist.
> 
> 
> What this allows me to do (as seen later in the patchset) is to send
> to call pid_nr_ns and deliver a signal to a task group without caring
> if the element of the task group I am talking to is a zombie or not.

Yes I see, thanks Eric. I was confused by changelog, and I missed the
subsequent changes which need task_active_pid_ns() of the reciever.

Oleg.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 8/9] signal: Drop signals before sending them to init.
       [not found]                                     ` <20071213162502.GB219-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org>
@ 2007-12-13 17:50                                       ` Eric W. Biederman
       [not found]                                         ` <m1bq8uscu4.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-13 17:50 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Linux Containers, Andrew Morton, Pavel Emelyanov

Oleg Nesterov <oleg-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org> writes:

> On 12/12, Eric W. Biederman wrote:
>> 
>> By making the rule (for init dropping signals):
>> When sending a signal to init, the presence of a signal handler that
>> is not SIG_DFL allows the signal to be sent to init.  If the signal
>> is not sent it is silently dropped without becoming pending.
>
> But isn't it better to modify sig_ignore() and handle_stop_signal()
> instead? This way we seem to need less changes,
>
> 	http://marc.info/?l=linux-kernel&m=118753610515859
>
> (the patch above itself is not complete and a bit obsolete)

No.  That way leads to semantic disaster.

In particular if it is traced or blocked your patch did not
ignore the signal, the signal would be queued up.  We would
not know who the sender is and thus not know the proper
disposition for the signal.

For a clean definition that we can maintain into the future
either we must either drop the signal before it is placed on the
queue or otherwise processed.  Or we must track the sender.

Tracking the sender is expensive.

>> The only noticeable user space difference from todays init is that it
>> no longer needs to worry about signals becoming pending when it has
>> them marked as SIG_DFL and blocked.
>
> Ugh. I have to apologize again. I got a fever, and it turns out I just
> can't read English.
>
> So, do you mean we can ignore the problems with the signals which are
> currently blocked by /sbin/init?

Yes.  Further I am saying those signals will never become pending if
we do not have a signal handler installed.

> I personally agree, but I'm not sure I understand this right.
>
>> +static int sig_init_drop(struct task_struct *tsk, int sig)
>> +{
>> +	/* All signals for which init has a SIG_DFL handler are
>> +	 * silently dropped without being sent.
>> +	 */
>> +	if (!is_sig_init(tsk))
>> +		return 0;
>> +
>> +	return (tsk->sighand->action[sig-1].sa.sa_handler == SIG_DFL);
>> +}
>
> What if /sbin/init has a handler, but before this signal is delivered
> /sbin/init does signal(SIG_DFL) ? We should modify so_sigaction() to
> prevent this. Note again the patch above.

No.  We should treat signals that we process for /sbin/init completely
normally.

If we try and do the right thing with pending signals we have the question
which pid namespaces sent the signal.  If the signal came from a decedent
of /sbin/init we should drop the signal.  If the signal came from outside
the pid namespace of init we should keep the signal.  There is no right
correct if you define what happens that way.

Therefore we make a small change to the definition of where we drop
signals to init.  We always unconditionally drop them in the sender
if init has the signal handler set to SIG_DFL.  Otherwise the signals
are processed normally.

This gives /sbin/init completely normal signal handling if the signal is
ever enqueued.  Something trivial to implement and explain.

Theoretically the new definition can deliver a few more signals to
/sbin/init and see them processed.  In practice I do not expect this
to matter.  Especially since /sbin/init did ask for the signals at
one time.

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 9/9] signal: Ignore signals sent to the pid namespace init
       [not found]                                         ` <20071213162811.GC219-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org>
@ 2007-12-13 18:16                                           ` Eric W. Biederman
       [not found]                                             ` <m1aboesbnu.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-13 18:16 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Linux Containers, Andrew Morton, Pavel Emelyanov

Oleg Nesterov <oleg-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org> writes:

> On 12/12, Eric W. Biederman wrote:
>> 
>> -static int is_sig_init(struct task_struct *tsk)
>> +static int is_sig_init(struct task_struct *init, struct pid *sender)
>>  {
>> -	if (likely(!is_global_init(tsk->group_leader)))
>> +	if (!is_container_init(init))
>> +		return 0;
>> +
>> +	if (!sender)
>> +		sender = task_tgid(current);
>
> What if this signal is sent from the interrupt and sender == NULL?
>
>> +
>> +	if (!pid_in_pid_ns(sender, task_active_pid_ns(init)))
>>  		return 0;
>
> In that case the result of the above check can be wrong, no?

Yes.  If we are not in process context (in_interrupt) we do infer
the sender incorrectly.  Duh.  I saw something in earlier patches
people had posted didn't understand it, and didn't get an answer
when I asked about it.  I guess I should have thought about that
corner case and looked a little harder.

In this case the sender would always be the kernel.  Oleg do you
know which signals are sent from interrupt context and through
which signal sending entry points?

I'm inclined to say:
if (!sender && in_interrupt())
	sender = &init_struct_pid;

Which will cause the signal to be dropped if SIG_DFL for the real init
(same pid namespace), and otherwise cause the signal to be sent.  Which
sight unseen feels like the right thing.

It worries me that we are likely going through check_kill_permission and
all of the rest.

I expect instead of testing for in_interrupt I should be doing something
like the siginfo test in check_kill_permssion.  And just lumping all of
the kernel related signals into the same bin.

I place this one on the backburner of my thoughts and come back to it in
a bit.  I am close enough that it should be a simple straight forward code
change whatever the outcome.

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 8/9] signal: Drop signals before sending them to init.
       [not found]                                         ` <m1bq8uscu4.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-13 18:18                                           ` Oleg Nesterov
       [not found]                                             ` <20071213181802.GA486-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Oleg Nesterov @ 2007-12-13 18:18 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Containers, Andrew Morton, Pavel Emelyanov

On 12/13, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org> writes:
> 
> > So, do you mean we can ignore the problems with the signals which are
> > currently blocked by /sbin/init?
> 
> Yes.  Further I am saying those signals will never become pending if
> we do not have a signal handler installed.

OK, if we change the semantics for /sbin/init signals we can avoid
a lot of problems,

> > I personally agree, but I'm not sure I understand this right.
> >
> >> +static int sig_init_drop(struct task_struct *tsk, int sig)
> >> +{
> >> +	/* All signals for which init has a SIG_DFL handler are
> >> +	 * silently dropped without being sent.
> >> +	 */
> >> +	if (!is_sig_init(tsk))
> >> +		return 0;
> >> +
> >> +	return (tsk->sighand->action[sig-1].sa.sa_handler == SIG_DFL);
> >> +}
> >
> > What if /sbin/init has a handler, but before this signal is delivered
> > /sbin/init does signal(SIG_DFL) ? We should modify so_sigaction() to
> > prevent this. Note again the patch above.
> 
> No.  We should treat signals that we process for /sbin/init completely
> normally.

... including this one. I am not arguing.

> This gives /sbin/init completely normal signal handling if the signal is
> ever enqueued.  Something trivial to implement and explain.

Well, I am not sure about "explain" though. Unless I missed something
this makes the semantics a bit special.

Suppose that init does sigtimedwait() but the handler == SIG_DFL.

Oleg.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 9/9] signal: Ignore signals sent to the pid namespace init
       [not found]                                             ` <m1aboesbnu.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-13 18:33                                               ` Eric W. Biederman
       [not found]                                                 ` <m13au6savt.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-13 18:33 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Linux Containers, Andrew Morton, Pavel Emelyanov

ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:
>
> Yes.  If we are not in process context (in_interrupt) we do infer
> the sender incorrectly.  Duh.  I saw something in earlier patches
> people had posted didn't understand it, and didn't get an answer
> when I asked about it. 

Grr. Rather I didn't see a reply when I asked about it.  I just
spotted Suka's old reply telling me it was to attempt to prevent
init from getting sigio.    Clearly I have a big fat blind spot
here you could drive a truck through.

Kernel generated signals are a pain because they are neither fish nor
fowl. They are sent privileged, but we don't necessarily want to give
them all curtsies and deliver them.  Grr.

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 8/9] signal: Drop signals before sending them to init.
       [not found]                                             ` <20071213181802.GA486-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org>
@ 2007-12-13 18:50                                               ` Eric W. Biederman
       [not found]                                                 ` <m1y7byqvj2.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-13 18:50 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Linux Containers, Andrew Morton, Pavel Emelyanov

Oleg Nesterov <oleg-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org> writes:

> OK, if we change the semantics for /sbin/init signals we can avoid
> a lot of problems,

Yes.  Otherwise we must track the source of the signals.

>> No.  We should treat signals that we process for /sbin/init completely
>> normally.
>
> ... including this one. I am not arguing.
>
>> This gives /sbin/init completely normal signal handling if the signal is
>> ever enqueued.  Something trivial to implement and explain.
>
> Well, I am not sure about "explain" though. Unless I missed something
> this makes the semantics a bit special.

Well the semantics are a bit special for init period.  I just
make them special in a slightly different way.

> Suppose that init does sigtimedwait() but the handler == SIG_DFL.

Yes that is a bit surprising.  However it is still easy to explain.
The signal is never enqueued so sigtimedwait never gets the chance
to do anything with it.  Interestingly enough this is not a problem
for the current sysvinit.

sysvinit does this at start up:
        /*
         *      Ignore all signals.
         */
        for(f = 1; f <= NSIG; f++)
                SETSIG(sa, f, SIG_IGN, SA_RESTART);

So everything is initialized to SIG_IGN by userspace, in the common case.
Which means none of this special case logic will actually kick in, except
for SIGKILL and SIGSTOP.  The signals we can't change.

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/9] pid: Implement ns_of_pid.
       [not found]                         ` <20071213032827.GA1433-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2007-12-15  0:35                           ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA
  0 siblings, 0 replies; 52+ messages in thread
From: sukadev-r/Jw6+rmf7HQT0dZR+AlfA @ 2007-12-15  0:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, Andrew Morton, Oleg Nesterov, Pavel Emelyanov

sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org [sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org] wrote:

| I did some initial testing on your patchset (minus patch 5) and noticed
| that it seems to be missing the patch to address kill -1 semantics
| (here is the earlier version). 

I thought I double checked, but still missed the other patchset that
addressed this. Sorry. Will review/test with that also.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 8/9] signal: Drop signals before sending them to init.
       [not found]                                                 ` <m1y7byqvj2.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-16 15:52                                                   ` Oleg Nesterov
       [not found]                                                     ` <20071216155244.GA216-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Oleg Nesterov @ 2007-12-16 15:52 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Containers, Andrew Morton, Pavel Emelyanov

On 12/13, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org> writes:
> 
> > OK, if we change the semantics for /sbin/init signals we can avoid
> > a lot of problems,
> 
> Yes.  Otherwise we must track the source of the signals.

Yes.

> > Well, I am not sure about "explain" though. Unless I missed something
> > this makes the semantics a bit special.
> 
> Well the semantics are a bit special for init period.  I just
> make them special in a slightly different way.

Yes.

But still I personally can't agree with the new behaviour.

> > Suppose that init does sigtimedwait() but the handler == SIG_DFL.
>
> Yes that is a bit surprising.  However it is still easy to explain.
> The signal is never enqueued so sigtimedwait never gets the chance
> to do anything with it.

But this precisely means that sigtimedwait() is just broken for init.

Another example. /sbin/init execs, and tries to make sure it doesn't
miss the important signal (say, SIGCHLD). So it blocks SIGCHLD, execs,
and the new program installs the handler. However, the handler == SIG_DFL
right after exec, so signal could be missed.

Eric, I think the blocked signal should be enqueued. If application
blocks the signal, this is the strong indication it does care about
it, we shouldn't throw it out.

Yes, this also has surprises. If init unblocks the signal without
installing the handler, it could be killed. But this can happen with
your patch as well. sig_init_drop() returns false if we have a handler,
but this races with sys_rt_sigaction() which can set SIG_DFL, so init
could be killed.

IOW, I still have a strong feeling that this patch

	http://marc.info/?l=linux-kernel&m=118753610515859

is better, and more correct. That said, this all is very subjective,
I can't "prove" this of course.

> Interestingly enough this is not a problem
> for the current sysvinit.
> 
> sysvinit does this at start up:
>         /*
>          *      Ignore all signals.
>          */
>         for(f = 1; f <= NSIG; f++)
>                 SETSIG(sa, f, SIG_IGN, SA_RESTART);

Yes sure, the "good" init shouldn't have any problems with any approach.

Oleg.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/9] Core pid namespace enhancements
       [not found] ` <m13au8ytos.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  2007-12-12 12:40   ` [PATCH 1/9] sig: Fix mqueue pid Eric W. Biederman
@ 2007-12-18  0:52   ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA
  1 sibling, 0 replies; 52+ messages in thread
From: sukadev-r/Jw6+rmf7HQT0dZR+AlfA @ 2007-12-18  0:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, Andrew Morton, Oleg Nesterov, Pavel Emelyanov

Eric W. Biederman [ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org] wrote:
| 
| The following patchset updates the pid namespace infrastructure
| so we don't constantly have to worry if we have been called
| before or after exit_task_namespaces, by using the pid_namespace
| obtained from a processes pid, handles the general case of setting
| si_pid in struct sig_info, changes where we drop signals sent to init,
| and enhances that changes to also work with the per namespace init.
| 
| Thus resolving most of the big gotchas with the current pid namespace
| implementation.
| 
| Eric

The patchset looks good to me.  My only testcase from the previous set
that broke was the case that Oleg pointed out

	- container-init sets a handler for SIGUSR1
	- container-init blocks SIGUSR1
	- a descendant of container-init posts SIGUSR1 to container-init
	- container-init sets SIGUSR1 to SIG_DFL and unblocks and takes the
	  fatal signal.

While that discussion can continue...

<Acked-by>: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 8/9] signal: Drop signals before sending them to init.
       [not found]                                                     ` <20071216155244.GA216-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org>
@ 2007-12-18  4:06                                                       ` Eric W. Biederman
       [not found]                                                         ` <m1ir2wd4tf.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-18  4:06 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Linux Containers, Andrew Morton, Pavel Emelyanov

Oleg Nesterov <oleg-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org> writes:

> On 12/13, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org> writes:
>> 
>> > OK, if we change the semantics for /sbin/init signals we can avoid
>> > a lot of problems,
>> 
>> Yes.  Otherwise we must track the source of the signals.
>
> Yes.

Good.  We have fundamental agreement that we should slightly
change the definition of how signals to init are dropped.

For me the really important part is that we have a clean
definition so that we can fix bugs in the implementation.
Rather then declaring a particular implementation is the
definition and getting lost there.

>> > Well, I am not sure about "explain" though. Unless I missed something
>> > this makes the semantics a bit special.
>> 
>> Well the semantics are a bit special for init period.  I just
>> make them special in a slightly different way.
>
> Yes.
>
> But still I personally can't agree with the new behaviour.

You argument that a program that blocks signals wants them is
compelling, especially since blocking signals is not a common
thing to do.

So I would have no problem with a definition said signals
will be dropped when sent to init if at the time they are
sent the signal is SIG_DFL and unblocked.

>> > Suppose that init does sigtimedwait() but the handler == SIG_DFL.
>>
>> Yes that is a bit surprising.  However it is still easy to explain.
>> The signal is never enqueued so sigtimedwait never gets the chance
>> to do anything with it.
>
> But this precisely means that sigtimedwait() is just broken for init.
>
> Another example. /sbin/init execs, and tries to make sure it doesn't
> miss the important signal (say, SIGCHLD). So it blocks SIGCHLD, execs,
> and the new program installs the handler. However, the handler == SIG_DFL
> right after exec, so signal could be missed.
>
> Eric, I think the blocked signal should be enqueued. If application
> blocks the signal, this is the strong indication it does care about
> it, we shouldn't throw it out.

A reasonable argument.

> Yes, this also has surprises. If init unblocks the signal without
> installing the handler, it could be killed. But this can happen with
> your patch as well. sig_init_drop() returns false if we have a handler,
> but this races with sys_rt_sigaction() which can set SIG_DFL, so init
> could be killed.

I am checking under the sighand lock so we should not race,
at least not internally to the kernel.

> IOW, I still have a strong feeling that this patch
>
> 	http://marc.info/?l=linux-kernel&m=118753610515859
>
> is better, and more correct. That said, this all is very subjective,
> I can't "prove" this of course.

My fundamental problem with that patch is that it drops signals
after we have started processing them, and it modifies the code
of an optimization.

To have a clean definition and clean semantics I think we need
to drop the signal earlier in the path.  Which is what I
really object to in your patch.

> Yes sure, the "good" init shouldn't have any problems with any approach.

So the only day to day case we care about are programs that are not
built to be init like /bin/bash running as init.  For cases like
this I can see a real advantage in not dropping blocked signals going
to init, they might even use sigtimedwait.

For day to day usage I don't see any practical difference in
dropping or not dropping blocked signals, because the default
is unblocked.

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 9/9] signal: Ignore signals sent to the pid namespace init
       [not found]                                                 ` <m13au6savt.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-18  8:37                                                   ` Eric W. Biederman
  0 siblings, 0 replies; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-18  8:37 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Linux Containers, Andrew Morton, Pavel Emelyanov

ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:

> ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:
>>
>> Yes.  If we are not in process context (in_interrupt) we do infer
>> the sender incorrectly.  Duh.  I saw something in earlier patches
>> people had posted didn't understand it, and didn't get an answer
>> when I asked about it. 
>
> Grr. Rather I didn't see a reply when I asked about it.  I just
> spotted Suka's old reply telling me it was to attempt to prevent
> init from getting sigio.    Clearly I have a big fat blind spot
> here you could drive a truck through.
>
> Kernel generated signals are a pain because they are neither fish nor
> fowl. They are sent privileged, but we don't necessarily want to give
> them all curtsies and deliver them.  Grr.

I solved this in my thinking earlier but I should really mention it.
There are two kinds of kernel generated signals.

- Services like timers, SIGIO, etc where the kernel is doing something
  on behalf of the process.
- Process management where we send SIGKILL to make a process go away.

For services we clearly want to ignore the signal.
For the kernel flexing's it's muscle we want to accept the signal,
and let init die and accept the consequences.  Unfortunately we
don't do that today.

Since most signals are services and since we want to preserve today
semantics we want to drop kernel generated signals.

For those cases where it matters we can have the kernel send a signal
by a different path, that doesn't give init the option of ignoring
the signal.

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 8/9] signal: Drop signals before sending them to init.
       [not found]                                                         ` <m1ir2wd4tf.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-18 12:22                                                           ` Oleg Nesterov
       [not found]                                                             ` <20071218122241.GA307-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Oleg Nesterov @ 2007-12-18 12:22 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Containers, Andrew Morton, Pavel Emelyanov

On 12/17, Eric W. Biederman wrote:
>
> So I would have no problem with a definition said signals
> will be dropped when sent to init if at the time they are
> sent the signal is SIG_DFL and unblocked.

Great!

> > But this can happen with
> > your patch as well. sig_init_drop() returns false if we have a handler,
> > but this races with sys_rt_sigaction() which can set SIG_DFL, so init
> > could be killed.
> 
> I am checking under the sighand lock so we should not race,
> at least not internally to the kernel.

Yes, but as soon as we drop ->siglock /sbin/init can set SIG_DFL before
noticing the signal.

> > IOW, I still have a strong feeling that this patch
> >
> > 	http://marc.info/?l=linux-kernel&m=118753610515859
> >
> > is better, and more correct. That said, this all is very subjective,
> > I can't "prove" this of course.
> 
> My fundamental problem with that patch is that it drops signals
> after we have started processing them, and it modifies the code
> of an optimization.
> 
> To have a clean definition and clean semantics I think we need
> to drop the signal earlier in the path.  Which is what I
> really object to in your patch.

Hmm. Could you look at this patch again? I'm attaching it at the end.
(re-diffed against the current code)

It modifies sig_ignored(), so we drop the signal before we started
processing. And in fact it is more "optimized", because we don't need
to check sa_handler twice.

Btw. I don't think we should change force_sig_info(). Suppose that init
blocks/ignores SIGSEGV and do_page_fault() does force_sig_info_fault().
In that case it is better to die explicitely than go into the endless
loop.

Oleg.

--- t/kernel/signal.c~INITSIGS	2007-08-19 14:39:35.000000000 +0400
+++ t/kernel/signal.c	2007-08-19 19:00:27.000000000 +0400
@@ -39,11 +39,33 @@
 
 static struct kmem_cache *sigqueue_cachep;
 
+static int sig_init_ignore(struct task_struct *tsk)
+{
+	if (likely(!is_container_init(tsk->group_leader)))
+		return 0;
+
+	// ---------------- Multiple pid namespaces ----------------
+	// if (current is from tsk's parent pid_ns && !in_interrupt())
+	//	return 0;
+
+	return 1;
+}
+
+static int sig_task_ignore(struct task_struct *tsk, int sig)
+{
+	void __user * handler = tsk->sighand->action[sig-1].sa.sa_handler;
+
+	if (handler == SIG_IGN)
+		return 1;
+
+	if (handler != SIG_DFL)
+		return 0;
+
+	return sig_kernel_ignore(sig) || sig_init_ignore(tsk);
+}
 
 static int sig_ignored(struct task_struct *t, int sig)
 {
-	void __user * handler;
-
 	/*
 	 * Tracers always want to know about signals..
 	 */
@@ -58,10 +82,7 @@ static int sig_ignored(struct task_struc
 	if (sigismember(&t->blocked, sig) || sigismember(&t->real_blocked, sig))
 		return 0;
 
-	/* Is it explicitly or implicitly ignored? */
-	handler = t->sighand->action[sig-1].sa.sa_handler;
-	return   handler == SIG_IGN ||
-		(handler == SIG_DFL && sig_kernel_ignore(sig));
+	return sig_task_ignore(t, sig);
 }
 
 /*
@@ -566,6 +587,9 @@ static void handle_stop_signal(int sig, 
 		 */
 		return;
 
+	if (sig_init_ignore(p))
+		return;
+
 	if (sig_kernel_stop(sig)) {
 		/*
 		 * This is a stop signal.  Remove SIGCONT from all queues.
@@ -1786,12 +1810,6 @@ relock:
 		if (sig_kernel_ignore(signr)) /* Default is nothing. */
 			continue;
 
-		/*
-		 * Global init gets no signals it doesn't want.
-		 */
-		if (is_global_init(current))
-			continue;
-
 		if (sig_kernel_stop(signr)) {
 			/*
 			 * The default action is to stop all threads in
@@ -2303,8 +2316,7 @@ int do_sigaction(int sig, struct k_sigac
 		 *   (for example, SIGCHLD), shall cause the pending signal to
 		 *   be discarded, whether or not it is blocked"
 		 */
-		if (act->sa.sa_handler == SIG_IGN ||
-		   (act->sa.sa_handler == SIG_DFL && sig_kernel_ignore(sig))) {
+		if (sig_task_ignore(current, sig)) {
 			struct task_struct *t = current;
 			sigemptyset(&mask);
 			sigaddset(&mask, sig);

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 8/9] signal: Drop signals before sending them to init.
       [not found]                                                             ` <20071218122241.GA307-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org>
@ 2007-12-18 13:36                                                               ` Eric W. Biederman
       [not found]                                                                 ` <m1prx49lag.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-18 13:36 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Linux Containers, Andrew Morton, Pavel Emelyanov

Oleg Nesterov <oleg-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org> writes:

> On 12/17, Eric W. Biederman wrote:
>>
>> So I would have no problem with a definition said signals
>> will be dropped when sent to init if at the time they are
>> sent the signal is SIG_DFL and unblocked.
>
> Great!

My only issue is that with including the blocked signals in
the definition is that it makes things a little harder to
explain and understand.

>> > But this can happen with
>> > your patch as well. sig_init_drop() returns false if we have a handler,
>> > but this races with sys_rt_sigaction() which can set SIG_DFL, so init
>> > could be killed.
>> 
>> I am checking under the sighand lock so we should not race,
>> at least not internally to the kernel.
>
> Yes, but as soon as we drop ->siglock /sbin/init can set SIG_DFL before
> noticing the signal.

I guess I don't see this as a race.  The definition in my head is
all about permission to send a signal to init.  That permission happens
if init shows interest in the incoming signal.  So it is an 
instant in time decision.

At another instant the permissions may have changed and we are
allowed to send the signal.

You seem to be implying that something is wrong.  If something
is wrong if something can be wrong either we have the definition wrong
or the implementation wrong.  If we can not implement things to be
correct with our new definition then it is the wrong definition.
So races are totally NOT ok.

I don't think you have made the switch to the permission check
mentality that I am proposing in my definition.

Can you restate from the perspective of allowing or denying
a signal how looking at the blocked state of the signal
is better then just looking at SIG_DFL?

>> > IOW, I still have a strong feeling that this patch
>> >
>> > 	http://marc.info/?l=linux-kernel&m=118753610515859
>> >
>> > is better, and more correct. That said, this all is very subjective,
>> > I can't "prove" this of course.
>> 
>> My fundamental problem with that patch is that it drops signals
>> after we have started processing them, and it modifies the code
>> of an optimization.
>> 
>> To have a clean definition and clean semantics I think we need
>> to drop the signal earlier in the path.  Which is what I
>> really object to in your patch.
>
> Hmm. Could you look at this patch again? I'm attaching it at the end.
> (re-diffed against the current code)
>
> It modifies sig_ignored(), so we drop the signal before we started
> processing. And in fact it is more "optimized", because we don't need
> to check sa_handler twice.

Yes.  It is more "optimized" but from what I can tell less correct.
It makes it really easy to get the definition wrong.   The big
problem is you allow all signals through in the case of ptrace.
Which is so totally wrong.  For an optimization (which sig_ignored is)
that is fine.  Since this new behavior is not an optimization we
just can't do that.

The definition needs to be something that guarantees the signal
is not sent to init (not ignored by init) if the conditions are
correct.

Semantically not sent is hugely different from ignored.

Doing it in sig_ignored is just a little to late in the signal sending
pathway, and I believe it will be a maintenance nightmare.

> Btw. I don't think we should change force_sig_info(). Suppose that init
> blocks/ignores SIGSEGV and do_page_fault() does force_sig_info_fault().
> In that case it is better to die explicitely than go into the endless
> loop.

Totally and my patch would be smaller if we did.  However I think we
should change which set of signals are dropped in a separate patch,
which is why 
definition of where signals are dropped and then
we should explicitly let signals through in a separate patch.

>
> Oleg.
>
> --- t/kernel/signal.c~INITSIGS	2007-08-19 14:39:35.000000000 +0400
> +++ t/kernel/signal.c	2007-08-19 19:00:27.000000000 +0400
> @@ -39,11 +39,33 @@
>  
>  static struct kmem_cache *sigqueue_cachep;
>  
> +static int sig_init_ignore(struct task_struct *tsk)
> +{
> +	if (likely(!is_container_init(tsk->group_leader)))
> +		return 0;
> +
> +	// ---------------- Multiple pid namespaces ----------------
> +	// if (current is from tsk's parent pid_ns && !in_interrupt())
> +	//	return 0;

We need to test siginfo to see if the signal is a signal from
the kernel not in_interrupt().  So (before handling namespaces
this would be)

static int sig_init_ignrore(struct task_struct *tsk, siginfo_t *info,
			    struct pid *sender)
{
        /* Grumble we should look at the TGID and not need to
         * pass in group_leader.
         */
	if (likely(!is_container_init(tsk->group_leader)))
        	return 0;

	/* Ignore signals from the kernel */
	if ((!is_si_special(info) && SI_FROM_KERNEL(info)) ||
	    (info != SEND_SIG_NOINFO))
		return 1;

	/* If the kernel didn't send the signal figure out who did */
  	if (!sender)
		sender = task_tgid(current);

	/* Don't drop user mode signals from an outer pid
	 * namespace.
	 */
	if (!pid_in_ns(sender, task_active_pid_ns(task)))
		return 0;

	return 1;
}

> +	return 1;
> +}
> +
> +static int sig_task_ignore(struct task_struct *tsk, int sig)
> +{
> +	void __user * handler = tsk->sighand->action[sig-1].sa.sa_handler;
> +
> +	if (handler == SIG_IGN)
> +		return 1;
> +
> +	if (handler != SIG_DFL)
> +		return 0;
> +
> +	return sig_kernel_ignore(sig) || sig_init_ignore(tsk);
> +}
>  
>  static int sig_ignored(struct task_struct *t, int sig)
>  {
> -	void __user * handler;
> -
>  	/*
>  	 * Tracers always want to know about signals..
>  	 */

Since we are dropping the signal before it is sent, tracers
should never see the signal.

> @@ -58,10 +82,7 @@ static int sig_ignored(struct task_struc
>  	if (sigismember(&t->blocked, sig) || sigismember(&t->real_blocked, sig))
>  		return 0;
>  
> -	/* Is it explicitly or implicitly ignored? */
> -	handler = t->sighand->action[sig-1].sa.sa_handler;
> -	return   handler == SIG_IGN ||
> -		(handler == SIG_DFL && sig_kernel_ignore(sig));
> +	return sig_task_ignore(t, sig);
>  }
>  
>  /*
> @@ -566,6 +587,9 @@ static void handle_stop_signal(int sig, 
>  		 */
>  		return;
>  
> +	if (sig_init_ignore(p))
> +		return;
> +
>  	if (sig_kernel_stop(sig)) {
>  		/*
>  		 * This is a stop signal.  Remove SIGCONT from all queues.
> @@ -1786,12 +1810,6 @@ relock:
>  		if (sig_kernel_ignore(signr)) /* Default is nothing. */
>  			continue;
>  
> -		/*
> -		 * Global init gets no signals it doesn't want.
> -		 */
> -		if (is_global_init(current))
> -			continue;
> -
>  		if (sig_kernel_stop(signr)) {
>  			/*
>  			 * The default action is to stop all threads in
> @@ -2303,8 +2316,7 @@ int do_sigaction(int sig, struct k_sigac
>  		 *   (for example, SIGCHLD), shall cause the pending signal to
>  		 *   be discarded, whether or not it is blocked"
>  		 */
> -		if (act->sa.sa_handler == SIG_IGN ||
> -		   (act->sa.sa_handler == SIG_DFL && sig_kernel_ignore(sig))) {
> +		if (sig_task_ignore(current, sig)) {
>  			struct task_struct *t = current;
>  			sigemptyset(&mask);
>  			sigaddset(&mask, sig);

Any time you start ignoring the signal here you are not thinking
in terms of never sending the signal or not.  Once a signal is sent
we must treat it like normal (to have a clean definition).

In the namespace case we can not look at a pending signal and decide
if we should drop it or not.  So changing sigaction is impossible.

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 8/9] signal: Drop signals before sending them to init.
       [not found]                                                                 ` <m1prx49lag.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-18 15:30                                                                   ` Oleg Nesterov
       [not found]                                                                     ` <20071218153007.GA437-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Oleg Nesterov @ 2007-12-18 15:30 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Containers, Andrew Morton, Pavel Emelyanov

On 12/18, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org> writes:
> 
> > On 12/17, Eric W. Biederman wrote:
> >>
> >> So I would have no problem with a definition said signals
> >> will be dropped when sent to init if at the time they are
> >> sent the signal is SIG_DFL and unblocked.
> >
> > Great!
> 
> My only issue is that with including the blocked signals in
> the definition is that it makes things a little harder to
> explain and understand.

Perhaps yes.

> >> > But this can happen with
> >> > your patch as well. sig_init_drop() returns false if we have a handler,
> >> > but this races with sys_rt_sigaction() which can set SIG_DFL, so init
> >> > could be killed.
> >> 
> >> I am checking under the sighand lock so we should not race,
> >> at least not internally to the kernel.
> >
> > Yes, but as soon as we drop ->siglock /sbin/init can set SIG_DFL before
> > noticing the signal.
> 
> I guess I don't see this as a race.  The definition in my head is
> all about permission to send a signal to init.  That permission happens
> if init shows interest in the incoming signal.  So it is an 
> instant in time decision.

Agreed.

> You seem to be implying that something is wrong.  If something
> is wrong if something can be wrong either we have the definition wrong
> or the implementation wrong.

I think that the resulting behaviour is not right. /sbin/init should not
die after signal(SIG_DFL) if it has a pending !sig_kernel_ignore() signal.

Of course this doesn't depend on "should we look at the blocked state or not"
issue.

> >> My fundamental problem with that patch is that it drops signals
> >> after we have started processing them, and it modifies the code
> >> of an optimization.
> >> 
> >> To have a clean definition and clean semantics I think we need
> >> to drop the signal earlier in the path.  Which is what I
> >> really object to in your patch.
> >
> > Hmm. Could you look at this patch again? I'm attaching it at the end.
> > (re-diffed against the current code)
> >
> > It modifies sig_ignored(), so we drop the signal before we started
> > processing. And in fact it is more "optimized", because we don't need
> > to check sa_handler twice.
> 
> Yes.  It is more "optimized" but from what I can tell less correct.
> It makes it really easy to get the definition wrong.   The big
> problem is you allow all signals through in the case of ptrace.
> Which is so totally wrong.

Well yes, but I don't think this is "totally wrong". The global init can't
be ptraced, so this doesn't matter. Ptracing of the sub-namespace init from
the parent namespace is special anyway, and currently is not fully supported.

> > --- t/kernel/signal.c~INITSIGS	2007-08-19 14:39:35.000000000 +0400
> > +++ t/kernel/signal.c	2007-08-19 19:00:27.000000000 +0400
> > @@ -39,11 +39,33 @@
> >  
> >  static struct kmem_cache *sigqueue_cachep;
> >  
> > +static int sig_init_ignore(struct task_struct *tsk)
> > +{
> > +	if (likely(!is_container_init(tsk->group_leader)))
> > +		return 0;
> > +
> > +	// ---------------- Multiple pid namespaces ----------------
> > +	// if (current is from tsk's parent pid_ns && !in_interrupt())
> > +	//	return 0;
> 
> We need to test siginfo to see if the signal is a signal from
> the kernel not in_interrupt().  So (before handling namespaces
> this would be)

We can siginfo to the list of arguments. But this is another issue.
in_interrupt() just means we should not check namespace.

> static int sig_init_ignrore(struct task_struct *tsk, siginfo_t *info,
> 			    struct pid *sender)
> {
>         /* Grumble we should look at the TGID and not need to
>          * pass in group_leader.
>          */
> 	if (likely(!is_container_init(tsk->group_leader)))
>         	return 0;
> 
> 
> 	/* Ignore signals from the kernel */
> 	if ((!is_si_special(info) && SI_FROM_KERNEL(info)) ||
> 	    (info != SEND_SIG_NOINFO))
> 		return 1;
> 
> 	/* If the kernel didn't send the signal figure out who did */
>   	if (!sender)
> 		sender = task_tgid(current);

I don't really understand why do we need this "sender". It is always
current, unless in_interrupt(). do_notify_parent() is special, yes,
but I don't see any problems here, we still can use current, no?

> >  static int sig_ignored(struct task_struct *t, int sig)
> >  {
> > -	void __user * handler;
> > -
> >  	/*
> >  	 * Tracers always want to know about signals..
> >  	 */
> 
> Since we are dropping the signal before it is sent, tracers
> should never see the signal.

please see above.

> > @@ -2303,8 +2316,7 @@ int do_sigaction(int sig, struct k_sigac
> >  		 *   (for example, SIGCHLD), shall cause the pending signal to
> >  		 *   be discarded, whether or not it is blocked"
> >  		 */
> > -		if (act->sa.sa_handler == SIG_IGN ||
> > -		   (act->sa.sa_handler == SIG_DFL && sig_kernel_ignore(sig))) {
> > +		if (sig_task_ignore(current, sig)) {
> >  			struct task_struct *t = current;
> >  			sigemptyset(&mask);
> >  			sigaddset(&mask, sig);
> 
> Any time you start ignoring the signal here you are not thinking
> in terms of never sending the signal or not.  Once a signal is sent
> we must treat it like normal (to have a clean definition).

We don't agree here.

> In the namespace case we can not look at a pending signal and decide
> if we should drop it or not.  So changing sigaction is impossible.

You mean that it is possible that this signal has come from the parent
namespace, and so we should die but not just discard the signal.

I think we can ignore this problem. If we had a handler before (when
the signal was sent), this is - imho - the correct behaviour. If not
then yes, /sbin/init can "accidentally" survive. But the parent namespace
can always use SIGKILL to really kill us.

But yes I agree, this changes one corner case to another. And let me
repeat, I don't claim that "I am right and you are not", and I can't
really prove that my approach is "technically" better. Just a personal
feeling about the "better" tradeoff. And I already said my taste is
perverted ;)

Oleg.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 8/9] signal: Drop signals before sending them to init.
       [not found]                                                                     ` <20071218153007.GA437-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org>
@ 2007-12-18 21:34                                                                       ` Eric W. Biederman
       [not found]                                                                         ` <m18x3radr1.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Eric W. Biederman @ 2007-12-18 21:34 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Linux Containers, Andrew Morton, Pavel Emelyanov

Oleg Nesterov <oleg-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org> writes:
>
>> > @@ -2303,8 +2316,7 @@ int do_sigaction(int sig, struct k_sigac
>> >  		 *   (for example, SIGCHLD), shall cause the pending signal to
>> >  		 *   be discarded, whether or not it is blocked"
>> >  		 */
>> > -		if (act->sa.sa_handler == SIG_IGN ||
>> > -		   (act->sa.sa_handler == SIG_DFL && sig_kernel_ignore(sig))) {
>> > +		if (sig_task_ignore(current, sig)) {
>> >  			struct task_struct *t = current;
>> >  			sigemptyset(&mask);
>> >  			sigaddset(&mask, sig);
>> 
>> Any time you start ignoring the signal here you are not thinking
>> in terms of never sending the signal or not.  Once a signal is sent
>> we must treat it like normal (to have a clean definition).
>
> We don't agree here.

Which seems to be the heart of the matter.

>> In the namespace case we can not look at a pending signal and decide
>> if we should drop it or not.  So changing sigaction is impossible.
>
> You mean that it is possible that this signal has come from the parent
> namespace, and so we should die but not just discard the signal.

Yes.

> I think we can ignore this problem. If we had a handler before (when
> the signal was sent), this is - imho - the correct behaviour. If not
> then yes, /sbin/init can "accidentally" survive. But the parent namespace
> can always use SIGKILL to really kill us.

Only because we can't change SIGKILL to SIG_DFL.

Think of force_siginfo and what happens when we stop dropping signals
on that path.  We send the signal and then before we process it
user space does signal(SIG_DFL), and we drop SIGSEGV. Ouch!

> But yes I agree, this changes one corner case to another. And let me
> repeat, I don't claim that "I am right and you are not", and I can't
> really prove that my approach is "technically" better. Just a personal
> feeling about the "better" tradeoff. And I already said my taste is
> perverted ;)

In this instance I can prove that my choice is better.

When the code is called into question and we must decided if
a code behavior is a bug or not we require a definition of
what the code is supposed to do.

Given our technical constraints of not being able to track
the source of the signal, and needing to appear as a normal
process to signal senders outside of the pid namespace we
don't have many choices of definition.  The definition that
I can see is:

   Signals sent to init will be silently dropped without
   ever being sent to init, when init has the signal
   handler set to SIG_DFL.

With that definition then any time we process a signal
in handle_stop_signal or allow the signal to be processed
in because of ptrace or anything else.  We are doing the
wrong thing.

That is why I drop the signal earlier.  That is why I don't
do things in sigaction.

Oleg if you can show me a definition that permits the behavior
in your patch we can look at it.  Currently I don't believe
that is possible.

The behavior of your code seems to be about picking convenient spots
in the code today and throwing away the signal there.  Doing what is
most convenient now tends to fails miserably when you try for a
completely different implementation.

My basic contention is that without a solid definition the code
is unmaintainable, because we can't tell bugs from features.

Oleg does picking a definition and sticking to it make sense to you?
Do you at least see where I am coming from?

Eric

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 8/9] signal: Drop signals before sending them to init.
       [not found]                                                                         ` <m18x3radr1.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
@ 2007-12-19 13:42                                                                           ` Oleg Nesterov
  0 siblings, 0 replies; 52+ messages in thread
From: Oleg Nesterov @ 2007-12-19 13:42 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Containers, Andrew Morton, Pavel Emelyanov

On 12/18, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org> writes:
> >
> >> In the namespace case we can not look at a pending signal and decide
> >> if we should drop it or not.  So changing sigaction is impossible.
> >
> > You mean that it is possible that this signal has come from the parent
> > namespace, and so we should die but not just discard the signal.
>
> Yes.
>
> > I think we can ignore this problem. If we had a handler before (when
> > the signal was sent), this is - imho - the correct behaviour. If not
> > then yes, /sbin/init can "accidentally" survive. But the parent namespace
> > can always use SIGKILL to really kill us.
>
> Only because we can't change SIGKILL to SIG_DFL.
>
> Think of force_siginfo and what happens when we stop dropping signals
> on that path.  We send the signal and then before we process it
> user space does signal(SIG_DFL), and we drop SIGSEGV. Ouch!

Not a problem.

First of all, this has nothing to do with init's problems, any application
can can do this with signal(SIG_IGN).

The most important case is SIGSEGV sent from do_trap/do_page_fault/etc.
Another sub-thread can "steal" the signal, but this is harmless. The signal
will be re-generated when application returns from the exception and restarts
the faulting instruction.

> > But yes I agree, this changes one corner case to another. And let me
> > repeat, I don't claim that "I am right and you are not", and I can't
> > really prove that my approach is "technically" better. Just a personal
> > feeling about the "better" tradeoff. And I already said my taste is
> > perverted ;)
>
> In this instance I can prove that my choice is better.
>
> When the code is called into question and we must decided if
> a code behavior is a bug or not we require a definition of
> what the code is supposed to do.
>
> Given our technical constraints of not being able to track
> the source of the signal, and needing to appear as a normal
> process to signal senders outside of the pid namespace we
> don't have many choices of definition.  The definition that
> I can see is:
>
>    Signals sent to init will be silently dropped without
>    ever being sent to init, when init has the signal
>    handler set to SIG_DFL.
>
> With that definition then any time we process a signal
> in handle_stop_signal or allow the signal to be processed
> in because of ptrace or anything else.  We are doing the
> wrong thing.

I never argued, you propose the very simple and understandable definition.

But this simple rule leads to non-obvious and not good consequences.
Imho, of course.

sigtimedwait() is broken, init can lost the signal during exec,
signal(sighandler) is safe but signal(SIG_DFL) is not.

And speaking about ptrace, it is very special anyway. Just look at
get_signal_to_deliver() which re-sends the signal after ptrace_stop().

> That is why I drop the signal earlier.

I can't understand this part of the discussion. What do you mean "earlier" ?
Note that the patch I showed changes handle_stop_signal(). Because I believe
it should be changed anyway to filter out kernel threads at least.

Regardless of which rules we use to drop the signal, I think it is more
natural to modify sig_ignored(), this also makes the patch smaller.

> Oleg if you can show me a definition that permits the behavior
> in your patch we can look at it.  Currently I don't believe
> that is possible.
>
> My basic contention is that without a solid definition the code
> is unmaintainable, because we can't tell bugs from features.

No, I can't show.

Eric, let's stop here. I don't believe we can convince each other.
This happens, and of course it is OK to have different opinions.
And in any case, I am happy with this discussion ;)

Let's go with your approach. In any case it solves the real problems
we have.

Oleg.

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2007-12-19 13:42 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-12-12 12:38 [PATCH 0/9] Core pid namespace enhancements Eric W. Biederman
     [not found] ` <m13au8ytos.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-12 12:40   ` [PATCH 1/9] sig: Fix mqueue pid Eric W. Biederman
     [not found]     ` <m1y7c0xezt.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-12 12:42       ` [PATCH 2/9] sig: Fix SI_USER si_pid Eric W. Biederman
     [not found]         ` <m1tzmoxexb.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-12 12:44           ` [PATCH 3/9] pid: Implement ns_of_pid Eric W. Biederman
     [not found]             ` <m1prxcxeum.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-12 12:46               ` [PATCH 4/9] pid: Generalize task_active_pid_ns Eric W. Biederman
     [not found]                 ` <m1lk80xeps.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-12 12:48                   ` [PATCH 5/9] pid: Update pid_vnr to use task_active_pid_ns Eric W. Biederman
     [not found]                     ` <m1hcioxenh.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-12 12:49                       ` [PATCH 6/9] pid: Implement pid_in_pid_ns Eric W. Biederman
     [not found]                         ` <m1d4tcxelu.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-12 12:52                           ` [PATCH 7/9] sig: Handle pid namespace crossing when sending signals Eric W. Biederman
     [not found]                             ` <m18x40xeg6.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-12 12:57                               ` [PATCH 8/9] signal: Drop signals before sending them to init Eric W. Biederman
     [not found]                                 ` <m13au8xe8m.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-12 12:58                                   ` [PATCH 9/9] signal: Ignore signals sent to the pid namespace init Eric W. Biederman
     [not found]                                     ` <m1y7c0vzm4.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-12 13:09                                       ` [PATCH 0/4] pid namespace infrastructure cleanups Eric W. Biederman
     [not found]                                         ` <m1odcwvz3d.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-12 13:27                                           ` [PATCH 1/4] pidns: Remove the child_reaper special case from de_thread Eric W. Biederman
     [not found]                                             ` <m1ir34vyaj.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-12 13:30                                               ` [PATCH 2/4] proc: Simplify proc_get_sb Eric W. Biederman
     [not found]                                                 ` <m1ejdsvy54.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-12 13:31                                                   ` [PATCH 3/4] proc: Remove the unnecessary global proc_mnt Eric W. Biederman
     [not found]                                                     ` <m1abogvy39.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-12 13:33                                                       ` [PATCH 4/4] pid: Move all of the pid_namespace logic into copy_pid_ns Eric W. Biederman
     [not found]                                                         ` <m163z4vxzs.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-12 13:46                                                           ` [PATCH 0/4] Properly handle talking to all processes in a pid namespace Eric W. Biederman
     [not found]                                                             ` <m11w9svxeb.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-12 13:49                                                               ` [PATCH 1/4] signal: Introduce kill_pid_ns_info Eric W. Biederman
     [not found]                                                                 ` <m1ve74uio4.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-12 13:50                                                                   ` [PATCH 2/4] pid: Make next_pidmap static again Eric W. Biederman
     [not found]                                                                     ` <m1r6hsuime.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-12 13:52                                                                       ` [PATCH 3/4] Fix the indentation in cap_set_all to use tabs Eric W. Biederman
     [not found]                                                                         ` <m1mysguijx.fsf_-_-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-12 13:56                                                                           ` [PATCH 4/4] pid: Limit cap_set_all to the current pid namespace Eric W. Biederman
2007-12-12 16:09                                                                   ` [PATCH 1/4] signal: Introduce kill_pid_ns_info Pavel Emelyanov
2007-12-12 13:42                                                   ` [PATCH 2/4] proc: Simplify proc_get_sb Pavel Emelyanov
     [not found]                                                     ` <475FE53D.6050408-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2007-12-12 14:06                                                       ` Eric W. Biederman
2007-12-13 16:28                                       ` [PATCH 9/9] signal: Ignore signals sent to the pid namespace init Oleg Nesterov
     [not found]                                         ` <20071213162811.GC219-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org>
2007-12-13 18:16                                           ` Eric W. Biederman
     [not found]                                             ` <m1aboesbnu.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-13 18:33                                               ` Eric W. Biederman
     [not found]                                                 ` <m13au6savt.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-18  8:37                                                   ` Eric W. Biederman
2007-12-12 19:00                                   ` [PATCH 8/9] signal: Drop signals before sending them to init Serge E. Hallyn
     [not found]                                     ` <20071212190042.GA22469-6s5zFf/epYLPQpwDFJZrxKsjOiXwFzmk@public.gmane.org>
2007-12-12 19:33                                       ` Eric W. Biederman
2007-12-13 16:25                                   ` Oleg Nesterov
     [not found]                                     ` <20071213162502.GB219-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org>
2007-12-13 17:50                                       ` Eric W. Biederman
     [not found]                                         ` <m1bq8uscu4.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-13 18:18                                           ` Oleg Nesterov
     [not found]                                             ` <20071213181802.GA486-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org>
2007-12-13 18:50                                               ` Eric W. Biederman
     [not found]                                                 ` <m1y7byqvj2.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-16 15:52                                                   ` Oleg Nesterov
     [not found]                                                     ` <20071216155244.GA216-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org>
2007-12-18  4:06                                                       ` Eric W. Biederman
     [not found]                                                         ` <m1ir2wd4tf.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-18 12:22                                                           ` Oleg Nesterov
     [not found]                                                             ` <20071218122241.GA307-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org>
2007-12-18 13:36                                                               ` Eric W. Biederman
     [not found]                                                                 ` <m1prx49lag.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-18 15:30                                                                   ` Oleg Nesterov
     [not found]                                                                     ` <20071218153007.GA437-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org>
2007-12-18 21:34                                                                       ` Eric W. Biederman
     [not found]                                                                         ` <m18x3radr1.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-19 13:42                                                                           ` Oleg Nesterov
2007-12-12 13:33                           ` [PATCH 6/9] pid: Implement pid_in_pid_ns Pavel Emelyanov
2007-12-12 13:28                       ` [PATCH 5/9] pid: Update pid_vnr to use task_active_pid_ns Pavel Emelyanov
     [not found]                         ` <475FE201.7060104-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2007-12-12 14:20                           ` Eric W. Biederman
2007-12-13 16:01                   ` [PATCH 4/9] pid: Generalize task_active_pid_ns Oleg Nesterov
     [not found]                     ` <20071213160128.GA219-6lXkIZvqkOAvJsYlp49lxw@public.gmane.org>
2007-12-13 16:22                       ` Eric W. Biederman
     [not found]                         ` <m1mysesgxc.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-13 17:07                           ` Oleg Nesterov
2007-12-13  0:59               ` [PATCH 3/9] pid: Implement ns_of_pid sukadev-r/Jw6+rmf7HQT0dZR+AlfA
     [not found]                 ` <20071213005945.GB27896-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-12-13  1:25                   ` Eric W. Biederman
     [not found]                     ` <m1ve73s7vr.fsf-T1Yj925okcoyDheHMi7gv2pdwda3JcWeAL8bYrjMMd8@public.gmane.org>
2007-12-13  3:28                       ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA
     [not found]                         ` <20071213032827.GA1433-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2007-12-15  0:35                           ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA
2007-12-12 13:24       ` [PATCH 1/9] sig: Fix mqueue pid Pavel Emelyanov
2007-12-18  0:52   ` [PATCH 0/9] Core pid namespace enhancements sukadev-r/Jw6+rmf7HQT0dZR+AlfA

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.