public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: Signals to cinit
       [not found] ` <20081110173839.GA11121@redhat.com>
@ 2008-11-10 19:32   ` Oleg Nesterov
  2008-11-10 23:27     ` sukadev
                       ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Oleg Nesterov @ 2008-11-10 19:32 UTC (permalink / raw)
  To: sukadev
  Cc: Eric W. Biederman, Pavel Emelyanov, daniel, Nadia Derbey, serue,
	clg, Containers, sukadev, linux-kernel

(lkml cced because containers list's archive is not useable)

On 11/10, Oleg Nesterov wrote:
>
> On 11/01, sukadev@linux.vnet.ibm.com wrote:
> >
> > Other approaches to try ?
>
> I think we should try to do something simple, even if not perfect. Because
> most users do not care about this problem since they do not use containers
> at all. It would be very sad to add intrusive changes to the code.
>
> I think we should fix another problem first. send_signal()->copy_siginfo()
> path must be changed anyway, when the signal comes from the parent ns we
> report the "wrong" si_code/si_pid, yes? So, somehow send_signal() must
> have "bool from_parent_ns" (or whatever) annyway.
>
> Now, let's forget forget for a moment that send_signal()->__sigqueue_alloc()
> can fail.
>
> I think we should encode this "from_parent_ns" into "struct siginfo". I do
> not think it is good idea to extend this structure, I think we can introduce
> SI_FROM_PARENT_NS or we perhaps can use "SI_FROMUSER(info) && info->si_pid == 0".
> Or something. yes, sys_rt_sigqueueinfo() is problematic...
>
> Now, copy_process(CLONE_NEWPID) sets child->signal |= SIGNAL_UNKILLABLE, this
> protects cinit from unwanted signals. Then we change get_signal_to_deliver()
>
> 	-	if (unlikely(signal->flags & SIGNAL_UNKILLABLE) &&
> 	+	if (unlikely(signal->flags & SIGNAL_UNKILLABLE) && !siginfo_from_parent_ns(info)
>
> and now we can kill cinit from parent ns. This needs more checks if we want
> to stop/strace it, but perhaps this is enough for the start. Note that we
> do not need to change complete_signal(), at least for now, the code under
> "if (sig_fatal(p, sig)" is just optimization.
>
>
> So, afaics, the only real problem is how we can handle the case when
> __sigqueue_alloc() fails. I think for the start we can just return
> -ENOMEM in this case (when from_parent_ns == T). Then we can improve
> this behaviour. We can change complete_signal() to ensure that the
> fatal signal from the upper ns always kills cinit, and in this case
> we ignore the the failed __sigqueue_alloc(). This way at least SIGKILL
> always works.
>
> Yes, this is not perfect, and it is very possible I missed something
> else. But simple.

But how can send_signal() know that the signal comes from the upper ns?
This is not trivial, we can't blindly use current to check. The signal
can be sent from irq/workqueue/etc.

Perhaps we can start with something like the patch below. Not that I like
it very much though. We should really place this code under
CONFIG_I_DO_CARE_ABOUT_NAMESPACES ;)

Oleg.

--- K-IS/kernel/signal.c~T	2008-11-10 19:21:17.000000000 +0100
+++ K-IS/kernel/signal.c	2008-11-10 20:31:24.000000000 +0100
@@ -798,11 +798,19 @@ static inline int legacy_queue(struct si
 	return (sig < SIGRTMIN) && sigismember(&signals->signal, sig);
 }
 
+#define SIG_FROM_USER	INT_MIN		/* MSB */
+
 static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
 			int group)
 {
 	struct sigpending *pending;
 	struct sigqueue *q;
+	int from_ancestor_ns;
+
+	from_ancestor_ns = !is_si_special(info) &&
+		(info->si_signo | SIG_FROM_USER) &&
+		/* if t can't see us we are from parent ns */
+		task_pid_nr_ns(current, t->current->nsproxy->pid_ns) <= 0;
 
 	trace_sched_signal_send(sig, t);
 
@@ -855,6 +863,7 @@ static int send_signal(int sig, struct s
 			break;
 		default:
 			copy_siginfo(&q->info, info);
+			q->info.si_signo &= ~SIG_FROM_USER;
 			break;
 		}
 	} else if (!is_si_special(info)) {
@@ -2207,7 +2216,7 @@ sys_kill(pid_t pid, int sig)
 {
 	struct siginfo info;
 
-	info.si_signo = sig;
+	info.si_signo = sig | SIG_FROM_USER;
 	info.si_errno = 0;
 	info.si_code = SI_USER;
 	info.si_pid = task_tgid_vnr(current);
@@ -2224,7 +2233,7 @@ static int do_tkill(pid_t tgid, pid_t pi
 	unsigned long flags;
 
 	error = -ESRCH;
-	info.si_signo = sig;
+	info.si_signo = sig | SIG_FROM_USER;
 	info.si_errno = 0;
 	info.si_code = SI_TKILL;
 	info.si_pid = task_tgid_vnr(current);
@@ -2296,7 +2305,7 @@ sys_rt_sigqueueinfo(pid_t pid, int sig, 
 	   Nor can they impersonate a kill(), which adds source info.  */
 	if (info.si_code >= 0)
 		return -EPERM;
-	info.si_signo = sig;
+	info.si_signo = sig | SIG_FROM_USER;
 
 	/* POSIX.1b doesn't mention process groups.  */
 	return kill_proc_info(sig, &info, pid);


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Signals to cinit
  2008-11-10 19:32   ` Signals to cinit Oleg Nesterov
@ 2008-11-10 23:27     ` sukadev
  2008-11-12 14:52       ` Oleg Nesterov
  2008-11-11  2:24     ` sukadev
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: sukadev @ 2008-11-10 23:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Pavel Emelyanov, daniel, Nadia Derbey, serue,
	clg, Containers, sukadev, linux-kernel

Oleg Nesterov [oleg@redhat.com] wrote:
| (lkml cced because containers list's archive is not useable)

Hmm. what do you mean by not usable ? I see your email here:
https://lists.linux-foundation.org/pipermail/containers/2008-November/014152.html

| 
| On 11/10, Oleg Nesterov wrote:
| >
| > On 11/01, sukadev@linux.vnet.ibm.com wrote:
| > >
| > > Other approaches to try ?
| >
| > I think we should try to do something simple, even if not perfect. Because
| > most users do not care about this problem since they do not use containers
| > at all. It would be very sad to add intrusive changes to the code.
| >
| > I think we should fix another problem first. send_signal()->copy_siginfo()
| > path must be changed anyway, when the signal comes from the parent ns we
| > report the "wrong" si_code/si_pid, yes? So, somehow send_signal() must
| > have "bool from_parent_ns" (or whatever) annyway.
| >
| > Now, let's forget forget for a moment that send_signal()->__sigqueue_alloc()
| > can fail.
| >
| > I think we should encode this "from_parent_ns" into "struct siginfo". I do
| > not think it is good idea to extend this structure, I think we can introduce
| > SI_FROM_PARENT_NS or we perhaps can use "SI_FROMUSER(info) && info->si_pid == 0".

Yes, afaics, we just need to pass one extra bit of information per signal
(whether or not sender is in ancestor-ns) from sender to receiver.

| > Or something. yes, sys_rt_sigqueueinfo() is problematic...

Yes, if user-space sets si_pid to 0.

Can we change sys_rt_sigqueueinfo() to:

	if (!info->si_pid)
		info->si_pid = getpid();

or would that change semantics adversely ? How about putting this
under CONFIG_PID_NS or your CONFIG_I_DO_CARE_ABOUT_NAMESPACES ;)

| >
| > Now, copy_process(CLONE_NEWPID) sets child->signal |= SIGNAL_UNKILLABLE, this
| > protects cinit from unwanted signals. Then we change get_signal_to_deliver()
| >
| > 	-	if (unlikely(signal->flags & SIGNAL_UNKILLABLE) &&
| > 	+	if (unlikely(signal->flags & SIGNAL_UNKILLABLE) && !siginfo_from_parent_ns(info)
| >
| > and now we can kill cinit from parent ns. This needs more checks if we want
| > to stop/strace it, but perhaps this is enough for the start. Note that we
| > do not need to change complete_signal(), at least for now, the code under
| > "if (sig_fatal(p, sig)" is just optimization.
| >
| >
| > So, afaics, the only real problem is how we can handle the case when
| > __sigqueue_alloc() fails. I think for the start we can just return
| > -ENOMEM in this case (when from_parent_ns == T). Then we can improve
| > this behaviour. We can change complete_signal() to ensure that the
| > fatal signal from the upper ns always kills cinit, and in this case
| > we ignore the the failed __sigqueue_alloc(). This way at least SIGKILL
| > always works.
| >
| > Yes, this is not perfect, and it is very possible I missed something
| > else. But simple.

I agree 
| 
| But how can send_signal() know that the signal comes from the upper ns?
| This is not trivial, we can't blindly use current to check. The signal
| can be sent from irq/workqueue/etc.

You mean the in_interrupt() check we had in earlier patchset would
not be enough ?

| 
| Perhaps we can start with something like the patch below. Not that I like
| it very much though. We should really place this code under
| CONFIG_I_DO_CARE_ABOUT_NAMESPACES ;)

CONFIG_PID_NS ?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Signals to cinit
  2008-11-10 19:32   ` Signals to cinit Oleg Nesterov
  2008-11-10 23:27     ` sukadev
@ 2008-11-11  2:24     ` sukadev
  2008-11-12 15:05       ` Oleg Nesterov
  2008-11-12 16:53     ` Serge E. Hallyn
  2008-11-13 19:10     ` Sukadev Bhattiprolu
  3 siblings, 1 reply; 13+ messages in thread
From: sukadev @ 2008-11-11  2:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Pavel Emelyanov, daniel, Nadia Derbey, serue,
	clg, Containers, sukadev, linux-kernel

Oleg Nesterov [oleg@redhat.com] wrote:
| (lkml cced because containers list's archive is not useable)
| 
| On 11/10, Oleg Nesterov wrote:
| >
| > On 11/01, sukadev@linux.vnet.ibm.com wrote:
| > >
| > > Other approaches to try ?
| >
| > I think we should try to do something simple, even if not perfect. Because
| > most users do not care about this problem since they do not use containers
| > at all. It would be very sad to add intrusive changes to the code.
| >
| > I think we should fix another problem first. send_signal()->copy_siginfo()
| > path must be changed anyway, when the signal comes from the parent ns we
| > report the "wrong" si_code/si_pid, yes? So, somehow send_signal() must
| > have "bool from_parent_ns" (or whatever) annyway.

Yes, this was in both the patchsets we reviewed last year :-) I can send
this fix out independently.

| >
| > Now, let's forget forget for a moment that send_signal()->__sigqueue_alloc()
| > can fail.
| >
| > I think we should encode this "from_parent_ns" into "struct siginfo". I do
| > not think it is good idea to extend this structure, I think we can introduce
| > SI_FROM_PARENT_NS or we perhaps can use "SI_FROMUSER(info) && info->si_pid == 0".
| > Or something. yes, sys_rt_sigqueueinfo() is problematic...

Also, what happens if a fatal signal is first received from a descendant 
and while that is still pending, the same signal is received from ancestor
ns ?  Won't the second one be ignored by legacy_queue() for the non-rt case ?

Of course, this is a new scenario, specific to containers, and we may be
able to define the policy without changing semantics.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Signals to cinit
  2008-11-10 23:27     ` sukadev
@ 2008-11-12 14:52       ` Oleg Nesterov
  2008-11-12 16:12         ` Oleg Nesterov
  2008-11-12 16:49         ` Serge E. Hallyn
  0 siblings, 2 replies; 13+ messages in thread
From: Oleg Nesterov @ 2008-11-12 14:52 UTC (permalink / raw)
  To: sukadev
  Cc: Eric W. Biederman, Pavel Emelyanov, daniel, Nadia Derbey, serue,
	clg, Containers, sukadev, linux-kernel

On 11/10, sukadev@linux.vnet.ibm.com wrote:
>
> Oleg Nesterov [oleg@redhat.com] wrote:
> | (lkml cced because containers list's archive is not useable)
>
> Hmm. what do you mean by not usable ? I see your email here:
> https://lists.linux-foundation.org/pipermail/containers/2008-November/014152.html

Yes, but I failed to find our previous discussions via google, and actually
I prefer to see them all on marc.info, so I can quickly find them...

> | > Or something. yes, sys_rt_sigqueueinfo() is problematic...
>
> Yes, if user-space sets si_pid to 0.
>
> Can we change sys_rt_sigqueueinfo() to:
>
> 	if (!info->si_pid)
> 		info->si_pid = getpid();

I doubt very much we can do this. This can break the existing applications
which can overload ->si_pid. I think it is better to pass ->si_pid as is.
If user-space sends siginfo_t so sub-namespace, it must know what it does.
I don't think the kernel can help, it just can't know what ->si_pid actually
means. Unless this is documented somewhere, but I don't know.

> | But how can send_signal() know that the signal comes from the upper ns?
> | This is not trivial, we can't blindly use current to check. The signal
> | can be sent from irq/workqueue/etc.
>
> You mean the in_interrupt() check we had in earlier patchset would
> not be enough ?

I don't think we can rely on in_interrupt() check. Thnk about some device
drivers which can send the notification from workqueue, or from kernel
thread... Say, can you see when drivers/usb/core/devio.c does
async_completed() ? And note that SI_ASYNCIO is SI_FROMUSER.

Even _if_ it is safe to use in_interrupt() right now, I don't think we
can rely on this fact.

> | Perhaps we can start with something like the patch below. Not that I like
> | it very much though. We should really place this code under
> | CONFIG_I_DO_CARE_ABOUT_NAMESPACES ;)
>
> CONFIG_PID_NS ?

Ah yes, we have it ;)

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Signals to cinit
  2008-11-11  2:24     ` sukadev
@ 2008-11-12 15:05       ` Oleg Nesterov
  2008-11-12 19:04         ` Sukadev Bhattiprolu
  0 siblings, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2008-11-12 15:05 UTC (permalink / raw)
  To: sukadev
  Cc: Eric W. Biederman, Pavel Emelyanov, daniel, Nadia Derbey, serue,
	clg, Containers, sukadev, linux-kernel

On 11/10, sukadev@linux.vnet.ibm.com wrote:
>
> Also, what happens if a fatal signal is first received from a descendant
> and while that is still pending, the same signal is received from ancestor
> ns ?  Won't the second one be ignored by legacy_queue() for the non-rt case ?

Please see my another email:

	We must also change sig_ignored() to drop SIGKILL/SIGSTOP early when
	it comes from the same ns. Otherwise, it can mask the next SIGKILL
	from the parent ns.

	But this perhaps makes sense anyway, even without containers.
	Currently, when the global init has the pending SIGKILL, we can't
	trust __wait_event_killable/etc, and this is actually wrong.

We can drop other SIG_DFL signals from the same namespace early as well.
I seem to already did something like sig_init_ignored(), but I forgot.

Or, we can just ignore this (imho) minor problem. The ancestor ns
must know it can't reliably kill cinit with (say) SIGTERM. It can
be ignored, or it can have have a handler, and it can be lost because
SIGTERM is already pending. Only SIGKILL is special.

Actually. I personally think that if we manage to achieve that

	- the sub-namespace can't kill its init

	- the ancestor can always kill cinit with SIGKILL

then imho we should not worry very much about other issues ;)

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Signals to cinit
  2008-11-12 14:52       ` Oleg Nesterov
@ 2008-11-12 16:12         ` Oleg Nesterov
  2008-11-12 16:49         ` Serge E. Hallyn
  1 sibling, 0 replies; 13+ messages in thread
From: Oleg Nesterov @ 2008-11-12 16:12 UTC (permalink / raw)
  To: sukadev
  Cc: Eric W. Biederman, Pavel Emelyanov, daniel, Nadia Derbey, serue,
	clg, Containers, sukadev, linux-kernel

On 11/12, Oleg Nesterov wrote:
>
> On 11/10, sukadev@linux.vnet.ibm.com wrote:
> >
> > Oleg Nesterov [oleg@redhat.com] wrote:
> > | > Or something. yes, sys_rt_sigqueueinfo() is problematic...
> >
> > Yes, if user-space sets si_pid to 0.
> >
> > Can we change sys_rt_sigqueueinfo() to:
> >
> > 	if (!info->si_pid)
> > 		info->si_pid = getpid();
>
> I doubt very much we can do this. This can break the existing applications
> which can overload ->si_pid. I think it is better to pass ->si_pid as is.
> If user-space sends siginfo_t so sub-namespace, it must know what it does.
> I don't think the kernel can help, it just can't know what ->si_pid actually
> means. Unless this is documented somewhere, but I don't know.

On the second thought, I think perhaps we should do the following.

if sys_rt_sigqueueinfo() sends the signal to the sub-namespace, then clear
always ->sid_pid. Otherwise do not touch it.

This way we can't break the existing apps, and this simplifies send_signal()
which should take "is_it_from_ancestor_ns" into account.

What do you think?

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Signals to cinit
  2008-11-12 14:52       ` Oleg Nesterov
  2008-11-12 16:12         ` Oleg Nesterov
@ 2008-11-12 16:49         ` Serge E. Hallyn
  2008-11-12 18:12           ` Sukadev Bhattiprolu
  1 sibling, 1 reply; 13+ messages in thread
From: Serge E. Hallyn @ 2008-11-12 16:49 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: sukadev, Eric W. Biederman, Pavel Emelyanov, daniel, Nadia Derbey,
	clg, Containers, sukadev, linux-kernel

Quoting Oleg Nesterov (oleg@redhat.com):
> > | Perhaps we can start with something like the patch below. Not that I like
> > | it very much though. We should really place this code under
> > | CONFIG_I_DO_CARE_ABOUT_NAMESPACES ;)
> >
> > CONFIG_PID_NS ?
> 
> Ah yes, we have it ;)

Except I believe all distros at this point enable CONFIG_PID_NS, so
I'm not sure it's the right thing to use.

-serge

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Signals to cinit
  2008-11-10 19:32   ` Signals to cinit Oleg Nesterov
  2008-11-10 23:27     ` sukadev
  2008-11-11  2:24     ` sukadev
@ 2008-11-12 16:53     ` Serge E. Hallyn
  2008-11-13 19:10     ` Sukadev Bhattiprolu
  3 siblings, 0 replies; 13+ messages in thread
From: Serge E. Hallyn @ 2008-11-12 16:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: sukadev, Eric W. Biederman, Pavel Emelyanov, daniel, Nadia Derbey,
	clg, Containers, sukadev, linux-kernel

Quoting Oleg Nesterov (oleg@redhat.com):
> --- K-IS/kernel/signal.c~T	2008-11-10 19:21:17.000000000 +0100
> +++ K-IS/kernel/signal.c	2008-11-10 20:31:24.000000000 +0100
> @@ -798,11 +798,19 @@ static inline int legacy_queue(struct si
>  	return (sig < SIGRTMIN) && sigismember(&signals->signal, sig);
>  }
> 
> +#define SIG_FROM_USER	INT_MIN		/* MSB */
> +
>  static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
>  			int group)
>  {
>  	struct sigpending *pending;
>  	struct sigqueue *q;
> +	int from_ancestor_ns;
> +
> +	from_ancestor_ns = !is_si_special(info) &&
> +		(info->si_signo | SIG_FROM_USER) &&

I assume you mean '&'?

> +		/* if t can't see us we are from parent ns */
> +		task_pid_nr_ns(current, t->current->nsproxy->pid_ns) <= 0;

This doesn't look so bad...  (but still looking through followup emails)

-serge

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Signals to cinit
  2008-11-12 16:49         ` Serge E. Hallyn
@ 2008-11-12 18:12           ` Sukadev Bhattiprolu
  2008-11-12 19:06             ` Serge E. Hallyn
  0 siblings, 1 reply; 13+ messages in thread
From: Sukadev Bhattiprolu @ 2008-11-12 18:12 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Oleg Nesterov, Eric W. Biederman, Pavel Emelyanov, daniel,
	Nadia Derbey, clg, Containers, sukadev, linux-kernel

Serge E. Hallyn [serue@us.ibm.com] wrote:
| Quoting Oleg Nesterov (oleg@redhat.com):
| > > | Perhaps we can start with something like the patch below. Not that I like
| > > | it very much though. We should really place this code under
| > > | CONFIG_I_DO_CARE_ABOUT_NAMESPACES ;)
| > >
| > > CONFIG_PID_NS ?
| > 
| > Ah yes, we have it ;)
| 
| Except I believe all distros at this point enable CONFIG_PID_NS, so
| I'm not sure it's the right thing to use.

But if they do enable CONFIG_PID_NS they would want the signals to
behave correctly ? IIUC, the reason we want to the hide the code
is that it is not clean i.e if its not experimental or error-prone,
are there other reasons someone with CONFIG_PID_NS=y want to hide it ?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Signals to cinit
  2008-11-12 15:05       ` Oleg Nesterov
@ 2008-11-12 19:04         ` Sukadev Bhattiprolu
  2008-11-14 17:26           ` Oleg Nesterov
  0 siblings, 1 reply; 13+ messages in thread
From: Sukadev Bhattiprolu @ 2008-11-12 19:04 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Pavel Emelyanov, daniel, Nadia Derbey, serue,
	clg, Containers, sukadev, linux-kernel

Oleg Nesterov [oleg@redhat.com] wrote:
| On 11/10, sukadev@linux.vnet.ibm.com wrote:
| >
| > Also, what happens if a fatal signal is first received from a descendant
| > and while that is still pending, the same signal is received from ancestor
| > ns ?  Won't the second one be ignored by legacy_queue() for the non-rt case ?

On second thoughts, cinit is a normal process in its ancestor ns so it
might very well ignore the second instance of the signal (as long as it
does not ignore SIGKILL/SIGSTOP)

| 
| Please see my another email:
| 
| 	We must also change sig_ignored() to drop SIGKILL/SIGSTOP early when
| 	it comes from the same ns. Otherwise, it can mask the next SIGKILL
| 	from the parent ns.

Ok.

| 
| 	But this perhaps makes sense anyway, even without containers.
| 	Currently, when the global init has the pending SIGKILL, we can't
| 	trust __wait_event_killable/etc, and this is actually wrong.
| 
| We can drop other SIG_DFL signals from the same namespace early as well.

I think Eric's patchset did this and iirc, we ran into the problem of
blocked SIG_DFL signals ?

| I seem to already did something like sig_init_ignored(), but I forgot.

Yes, I think we had that in the patchset but that was not merged.

| 
| Or, we can just ignore this (imho) minor problem.

I think so too.

| The ancestor ns
| must know it can't reliably kill cinit with (say) SIGTERM. It can
| be ignored, or it can have have a handler, and it can be lost because
| SIGTERM is already pending. Only SIGKILL is special.
| 
| Actually. I personally think that if we manage to achieve that
| 
| 	- the sub-namespace can't kill its init
| 
| 	- the ancestor can always kill cinit with SIGKILL

Yep.

| 
| then imho we should not worry very much about other issues ;)
| 
| Oleg.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Signals to cinit
  2008-11-12 18:12           ` Sukadev Bhattiprolu
@ 2008-11-12 19:06             ` Serge E. Hallyn
  0 siblings, 0 replies; 13+ messages in thread
From: Serge E. Hallyn @ 2008-11-12 19:06 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Oleg Nesterov, Eric W. Biederman, Pavel Emelyanov, daniel,
	Nadia Derbey, clg, Containers, sukadev, linux-kernel

Quoting Sukadev Bhattiprolu (sukadev@linux.vnet.ibm.com):
> Serge E. Hallyn [serue@us.ibm.com] wrote:
> | Quoting Oleg Nesterov (oleg@redhat.com):
> | > > | Perhaps we can start with something like the patch below. Not that I like
> | > > | it very much though. We should really place this code under
> | > > | CONFIG_I_DO_CARE_ABOUT_NAMESPACES ;)
> | > >
> | > > CONFIG_PID_NS ?
> | > 
> | > Ah yes, we have it ;)
> | 
> | Except I believe all distros at this point enable CONFIG_PID_NS, so
> | I'm not sure it's the right thing to use.
> 
> But if they do enable CONFIG_PID_NS they would want the signals to
> behave correctly ? IIUC, the reason we want to the hide the code
> is that it is not clean i.e if its not experimental or error-prone,
> are there other reasons someone with CONFIG_PID_NS=y want to hide it ?

I was going to argue yes, but again following my reasoning to its
logical conclusion leads us to a config parameter being bad anyway.

So yeah, never mind.

-serge

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Signals to cinit
  2008-11-10 19:32   ` Signals to cinit Oleg Nesterov
                       ` (2 preceding siblings ...)
  2008-11-12 16:53     ` Serge E. Hallyn
@ 2008-11-13 19:10     ` Sukadev Bhattiprolu
  3 siblings, 0 replies; 13+ messages in thread
From: Sukadev Bhattiprolu @ 2008-11-13 19:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Pavel Emelyanov, daniel, Nadia Derbey, serue,
	clg, Containers, sukadev, linux-kernel

Oleg Nesterov [oleg@redhat.com] wrote:
| (lkml cced because containers list's archive is not useable)
| 
| On 11/10, Oleg Nesterov wrote:
| >
| > On 11/01, sukadev@linux.vnet.ibm.com wrote:
| > >
| > > Other approaches to try ?
| >
| > I think we should try to do something simple, even if not perfect. Because
| > most users do not care about this problem since they do not use containers
| > at all. It would be very sad to add intrusive changes to the code.
| >
| > I think we should fix another problem first. send_signal()->copy_siginfo()
| > path must be changed anyway, when the signal comes from the parent ns we
| > report the "wrong" si_code/si_pid, yes? So, somehow send_signal() must
| > have "bool from_parent_ns" (or whatever) annyway.
| >
| > Now, let's forget forget for a moment that send_signal()->__sigqueue_alloc()
| > can fail.
| >
| > I think we should encode this "from_parent_ns" into "struct siginfo". I do
| > not think it is good idea to extend this structure, I think we can introduce
| > SI_FROM_PARENT_NS or we perhaps can use "SI_FROMUSER(info) && info->si_pid == 0".
| > Or something. yes, sys_rt_sigqueueinfo() is problematic...
| >
| > Now, copy_process(CLONE_NEWPID) sets child->signal |= SIGNAL_UNKILLABLE, this
| > protects cinit from unwanted signals. Then we change get_signal_to_deliver()
| >
| > 	-	if (unlikely(signal->flags & SIGNAL_UNKILLABLE) &&
| > 	+	if (unlikely(signal->flags & SIGNAL_UNKILLABLE) && !siginfo_from_parent_ns(info)
| >
| > and now we can kill cinit from parent ns. This needs more checks if we want
| > to stop/strace it, but perhaps this is enough for the start. Note that we
| > do not need to change complete_signal(), at least for now, the code under
| > "if (sig_fatal(p, sig)" is just optimization.
| >
| >
| > So, afaics, the only real problem is how we can handle the case when
| > __sigqueue_alloc() fails. I think for the start we can just return
| > -ENOMEM in this case (when from_parent_ns == T). Then we can improve
| > this behaviour. We can change complete_signal() to ensure that the
| > fatal signal from the upper ns always kills cinit, and in this case
| > we ignore the the failed __sigqueue_alloc(). This way at least SIGKILL
| > always works.
| >
| > Yes, this is not perfect, and it is very possible I missed something
| > else. But simple.
| 
| But how can send_signal() know that the signal comes from the upper ns?
| This is not trivial, we can't blindly use current to check. The signal
| can be sent from irq/workqueue/etc.
| 
| Perhaps we can start with something like the patch below. Not that I like
| it very much though. We should really place this code under
| CONFIG_I_DO_CARE_ABOUT_NAMESPACES ;)
| 
| Oleg.
| 
| --- K-IS/kernel/signal.c~T	2008-11-10 19:21:17.000000000 +0100
| +++ K-IS/kernel/signal.c	2008-11-10 20:31:24.000000000 +0100
| @@ -798,11 +798,19 @@ static inline int legacy_queue(struct si
|  	return (sig < SIGRTMIN) && sigismember(&signals->signal, sig);
|  }
| 
| +#define SIG_FROM_USER	INT_MIN		/* MSB */
| +

Not necessarily for the problem at hand, but in the long run, is it
worth isolating kernel's siginfo from user's siginfo_t (which is pretty
much carved in stone).

Like the existing 'struct k_sigaction' and 'struct sigaction' have
a 'struct k_siginfo' that is a superset of 'siginfo_t' ?

That might give us more flexibility in passing any additional flags/
values we need in the kernel ?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Signals to cinit
  2008-11-12 19:04         ` Sukadev Bhattiprolu
@ 2008-11-14 17:26           ` Oleg Nesterov
  0 siblings, 0 replies; 13+ messages in thread
From: Oleg Nesterov @ 2008-11-14 17:26 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Eric W. Biederman, Pavel Emelyanov, daniel, Nadia Derbey, serue,
	clg, Containers, sukadev, linux-kernel

On 11/12, Sukadev Bhattiprolu wrote:
>
> Oleg Nesterov [oleg@redhat.com] wrote:
>
> | On 11/10, sukadev@linux.vnet.ibm.com wrote:
> | >
> | > Also, what happens if a fatal signal is first received from a descendant
> | > and while that is still pending, the same signal is received from ancestor
> | > ns ?  Won't the second one be ignored by legacy_queue() for the non-rt case ?
>
> On second thoughts, cinit is a normal process in its ancestor ns so it
> might very well ignore the second instance of the signal (as long as it
> does not ignore SIGKILL/SIGSTOP)
>
> |
> | Please see my another email:
> |
> | 	We must also change sig_ignored() to drop SIGKILL/SIGSTOP early when
> | 	it comes from the same ns. Otherwise, it can mask the next SIGKILL
> | 	from the parent ns.
>
> Ok.
>
> |
> | 	But this perhaps makes sense anyway, even without containers.
> | 	Currently, when the global init has the pending SIGKILL, we can't
> | 	trust __wait_event_killable/etc, and this is actually wrong.
> |
> | We can drop other SIG_DFL signals from the same namespace early as well.
>
> I think Eric's patchset did this and iirc, we ran into the problem of
> blocked SIG_DFL signals ?

Yes sure, I meant unblocked SIG_DFL signals. But SIGKILL can't be
blocked fortunately.

Again, the parent ns can't rely on, say, SIGTERM. It can be missed
if cinit has a handler, we can do nothing in this case. And if it
is blocked, most probably cinit already has a handler, or it will
set it later, say, after exec. Or it can be just ignored.

> | Or, we can just ignore this (imho) minor problem.
>
> I think so too.

Great, so perhaps we can ignore the problem for now, and fix it
later if the need arises.

Oleg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2008-11-14 16:26 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20081101180505.GA24268@us.ibm.com>
     [not found] ` <20081110173839.GA11121@redhat.com>
2008-11-10 19:32   ` Signals to cinit Oleg Nesterov
2008-11-10 23:27     ` sukadev
2008-11-12 14:52       ` Oleg Nesterov
2008-11-12 16:12         ` Oleg Nesterov
2008-11-12 16:49         ` Serge E. Hallyn
2008-11-12 18:12           ` Sukadev Bhattiprolu
2008-11-12 19:06             ` Serge E. Hallyn
2008-11-11  2:24     ` sukadev
2008-11-12 15:05       ` Oleg Nesterov
2008-11-12 19:04         ` Sukadev Bhattiprolu
2008-11-14 17:26           ` Oleg Nesterov
2008-11-12 16:53     ` Serge E. Hallyn
2008-11-13 19:10     ` Sukadev Bhattiprolu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox