[PATCH] kill_something_info: don't take tasklist

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] kill_something_info: don't take tasklist_lock for pid==-1 case
@ 2008-03-25  4:27 Atsushi Tsuji
  2008-03-25 13:56 ` Oleg Nesterov
  0 siblings, 1 reply; 10+ messages in thread
From: Atsushi Tsuji @ 2008-03-25  4:27 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: linux-kernel

Hi Oleg,

Thanks for some patches about tasklist_lock. The avoidance of
tasklist_lock is very important for us.  And now, I found another
avoidable tasklist_lock, and made the patch. Could you please have a
look?

This patch avoid taking tasklist_lock in kill_something_info() when
the pid is -1. It can convert to rcu_read_lock() for this case because
group_send_sig_info() doesn't need tasklist_lock.

This patch is for 2.6.25-rc5-mm1.

Signed-off-by: Atsushi Tsuji <a-tsuji@bk.jp.nec.com>
---
diff --git a/kernel/signal.c b/kernel/signal.c
index 3edbfd4..a888c58 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1089,14 +1089,16 @@ static int kill_something_info(int sig, struct siginfo 
*info, int pid)
  		return ret;
  	}

-	read_lock(&tasklist_lock);
  	if (pid != -1) {
+		read_lock(&tasklist_lock);
  		ret = __kill_pgrp_info(sig, info,
  				pid ? find_vpid(-pid) : task_pgrp(current));
+		read_unlock(&tasklist_lock);
  	} else {
  		int retval = 0, count = 0;
  		struct task_struct * p;

+		rcu_read_lock();
  		for_each_process(p) {
  			if (p->pid > 1 && !same_thread_group(p, current)) {
  				int err = group_send_sig_info(sig, info, p);
@@ -1106,8 +1108,8 @@ static int kill_something_info(int sig, struct siginfo 
*info, int pid)
  			}
  		}
  		ret = count ? retval : -ESRCH;
+		rcu_read_unlock();
  	}
-	read_unlock(&tasklist_lock);

  	return ret;
  }

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] kill_something_info: don't take tasklist_lock for pid==-1 case
  2008-03-25  4:27 [PATCH] kill_something_info: don't take tasklist_lock for pid==-1 case Atsushi Tsuji
@ 2008-03-25 13:56 ` Oleg Nesterov
  2008-05-21  1:48   ` Atsushi Tsuji
  2008-05-21  3:47   ` Eric W. Biederman
  0 siblings, 2 replies; 10+ messages in thread
From: Oleg Nesterov @ 2008-03-25 13:56 UTC (permalink / raw)
  To: Atsushi Tsuji; +Cc: linux-kernel, Eric W. Biederman, Roland McGrath

On 03/25, Atsushi Tsuji wrote:
> 
> This patch avoid taking tasklist_lock in kill_something_info() when
> the pid is -1. It can convert to rcu_read_lock() for this case because
> group_send_sig_info() doesn't need tasklist_lock.
> 
> This patch is for 2.6.25-rc5-mm1.
> 
> Signed-off-by: Atsushi Tsuji <a-tsuji@bk.jp.nec.com>
> ---
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 3edbfd4..a888c58 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1089,14 +1089,16 @@ static int kill_something_info(int sig, struct 
> siginfo *info, int pid)
>  		return ret;
>  	}
> 
> -	read_lock(&tasklist_lock);
>  	if (pid != -1) {
> +		read_lock(&tasklist_lock);
>  		ret = __kill_pgrp_info(sig, info,
>  				pid ? find_vpid(-pid) : task_pgrp(current));
> +		read_unlock(&tasklist_lock);
>  	} else {
>  		int retval = 0, count = 0;
>  		struct task_struct * p;
> 
> +		rcu_read_lock();
>  		for_each_process(p) {
>  			if (p->pid > 1 && !same_thread_group(p, current)) {
>  				int err = group_send_sig_info(sig, info, p);
> @@ -1106,8 +1108,8 @@ static int kill_something_info(int sig, struct 
> siginfo *info, int pid)
>  			}
>  		}
>  		ret = count ? retval : -ESRCH;
> +		rcu_read_unlock();
>  	}
> -	read_unlock(&tasklist_lock);
> 
>  	return ret;
>  }

Hmm. Yes, group_send_sig_info() doesn't need tasklist_lock. But we
take tasklist_lock to "freeze" the tasks list, so that we can't miss
a new forked process.

Same for __kill_pgrp_info(), we take tasklist to kill the whole group
"atomically".

However. Is it really needed? copy_process() returns -ERESTARTNOINTR
if signal_pending(), and the new task is always placed at the tail
of the list. Looks like nobody can escape the signal, at least fatal
or SIGSTOP.

If the signal is blocked/ignored or has a handler, we can miss a forked
child, but this looks OK, we can pretend it was forked after we dropped
tasklist_lock.

Note also that copy_process() does list_add_tail_rcu(p->tasks) under
->siglock, this means kill_something_info() must see the new childs
after group_send_sig_info() drops ->siglock.

Except: We don't send the signal to /sbin/init. This means that (say)
kill(-1, SIGKILL) can miss the task forked by init. Note that this
task could be forked even before we start kill_something_info(), but
without tasklist there is no guarantee we will see it on the ->tasks
list.

I think this is the only problem with this change.

Eric, Roland?

(Unfortunately, attach_pid() adds the task to the head of hlist, this
 means we can't avoid tasklist for __kill_pgrp_info).

Oleg.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] kill_something_info: don't take tasklist_lock for pid==-1 case
  2008-03-25 13:56 ` Oleg Nesterov
@ 2008-05-21  1:48   ` Atsushi Tsuji
  2008-05-21  2:53     ` Eric W. Biederman
  2008-05-21  3:47   ` Eric W. Biederman
  1 sibling, 1 reply; 10+ messages in thread
From: Atsushi Tsuji @ 2008-05-21  1:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Eric W. Biederman, Roland McGrath, Andrew Morton

Oleg Nesterov wrote:
> On 03/25, Atsushi Tsuji wrote:
>> This patch avoid taking tasklist_lock in kill_something_info() when
>> the pid is -1. It can convert to rcu_read_lock() for this case because
>> group_send_sig_info() doesn't need tasklist_lock.
>>
>> This patch is for 2.6.25-rc5-mm1.
>>
>> Signed-off-by: Atsushi Tsuji <a-tsuji@bk.jp.nec.com>
>
> Except: We don't send the signal to /sbin/init. This means that (say)
> kill(-1, SIGKILL) can miss the task forked by init. Note that this
> task could be forked even before we start kill_something_info(), but
> without tasklist there is no guarantee we will see it on the ->tasks
> list.
> 
> I think this is the only problem with this change.

Sorry for late reply and thank you for your comment. I understood the
mechanism that kill(-1, SIGKILL) can miss the tasks forked by init
(and the thread group of the current process, because we don't also
send the signal to them). If kill(-1, SIGKILL) finish before the
forking init process does list_add_tail_rcu(p->tasks) in
copy_process(), the process forked by init appears on the ->tasks list
after that.  Is that right?

If so, I think this problem can happen without my patch.
Because even if kill(-1, SIGKILL) take read_lock(&tasklist_lock) in
kill_something_info(), it can finish before init process take
write_lock(&tasklist_lock) in copy_process(). So the forked process
appears after that, too.

Now, I noticed the important problem. I found the tasklist lock in
kill_something_info() can cause stall when some processes execute
kill(-1,SIGCONT) concurrently. It can happen even if a system has
only 4 CPUs (and even if a user is not privileged (not root)).  This is
because the writer cannot take the tasklist lock when a lot of readers
exist and keep holding it.

This allows a local DoS. So we have to avoid that stall. The
conversion from the tasklist lock to rcu_read_lock() can solve this
problem. I think my patch doesn't make the new problem because the
problem that kill can miss the tasks have originally occurred without
my one. If there is no problem, could you ack it?

Thanks,
-Atsushi Tsuji

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] kill_something_info: don't take tasklist_lock for pid==-1 case
  2008-05-21  1:48   ` Atsushi Tsuji
@ 2008-05-21  2:53     ` Eric W. Biederman
  0 siblings, 0 replies; 10+ messages in thread
From: Eric W. Biederman @ 2008-05-21  2:53 UTC (permalink / raw)
  To: Atsushi Tsuji; +Cc: Oleg Nesterov, linux-kernel, Roland McGrath, Andrew Morton

Atsushi Tsuji <a-tsuji@bk.jp.nec.com> writes:

> Sorry for late reply and thank you for your comment. I understood the
> mechanism that kill(-1, SIGKILL) can miss the tasks forked by init
> (and the thread group of the current process, because we don't also
> send the signal to them). If kill(-1, SIGKILL) finish before the
> forking init process does list_add_tail_rcu(p->tasks) in
> copy_process(), the process forked by init appears on the ->tasks list
> after that.  Is that right?

No because of this from fork.c:copy_process()
       /*
         * Process group and session signals need to be delivered to just the
         * parent before the fork or both the parent and the child after the
         * fork. Restart if a signal comes in before we add the new process to
         * it's process group.
         * A fatal signal pending means that current will exit, so the new
         * thread can't slip out of an OOM kill (or normal SIGKILL).
         */
        recalc_sigpending();
        if (signal_pending(current)) {
                spin_unlock(&current->sighand->siglock);
                write_unlock_irq(&tasklist_lock);
                retval = -ERESTARTNOINTR;
                goto bad_fork_free_pid;
        }

We closed that whole a while ago, and in doing so reviewed the semantics
and verify that the behavior is required.
>
> Now, I noticed the important problem. I found the tasklist lock in
> kill_something_info() can cause stall when some processes execute
> kill(-1,SIGCONT) concurrently. It can happen even if a system has
> only 4 CPUs (and even if a user is not privileged (not root)).  This is
> because the writer cannot take the tasklist lock when a lot of readers
> exist and keep holding it.
>
> This allows a local DoS. So we have to avoid that stall. The
> conversion from the tasklist lock to rcu_read_lock() can solve this
> problem. I think my patch doesn't make the new problem because the
> problem that kill can miss the tasks have originally occurred without
> my one. If there is no problem, could you ack it?

There are problems.  It would be nice to avoid the local DOS.  How is
a good question, given the atomic definition of signal delivery.

Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] kill_something_info: don't take tasklist_lock for pid==-1 case
  2008-03-25 13:56 ` Oleg Nesterov
  2008-05-21  1:48   ` Atsushi Tsuji
@ 2008-05-21  3:47   ` Eric W. Biederman
  2008-05-26  7:03     ` Atsushi TSUJI
  2008-05-31 16:55     ` Oleg Nesterov
  1 sibling, 2 replies; 10+ messages in thread
From: Eric W. Biederman @ 2008-05-21  3:47 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Atsushi Tsuji, linux-kernel, Roland McGrath

Sorry for the very delayed response.  I have been moving and overwhelmed
with everything.

Oleg Nesterov <oleg@tv-sign.ru> writes:

> On 03/25, Atsushi Tsuji wrote:
>> 
>> This patch avoid taking tasklist_lock in kill_something_info() when
>> the pid is -1. It can convert to rcu_read_lock() for this case because
>> group_send_sig_info() doesn't need tasklist_lock.
>> 
>> This patch is for 2.6.25-rc5-mm1.
>> 

> Hmm. Yes, group_send_sig_info() doesn't need tasklist_lock. But we
> take tasklist_lock to "freeze" the tasks list, so that we can't miss
> a new forked process.
>
> Same for __kill_pgrp_info(), we take tasklist to kill the whole group
> "atomically".
>
>
> However. Is it really needed? copy_process() returns -ERESTARTNOINTR
> if signal_pending(), and the new task is always placed at the tail
> of the list. Looks like nobody can escape the signal, at least fatal
> or SIGSTOP.

Call me paranoid but I don't think there is any guarantee without a lock
that we will hit the -ERESTARTNOITR check for new processes.  I think we
have a slight race where the fork process may not have received the signal
(because it is near the tail of the list) but the new process would be
added to the list immediately after we read it's pointer.

> Note also that copy_process() does list_add_tail_rcu(p->tasks) under
> ->siglock, this means kill_something_info() must see the new childs
> after group_send_sig_info() drops ->siglock.

That is subtle.  Switching to the per task siglock for protection.

> Except: We don't send the signal to /sbin/init. This means that (say)
> kill(-1, SIGKILL) can miss the task forked by init. Note that this
> task could be forked even before we start kill_something_info(), but
> without tasklist there is no guarantee we will see it on the ->tasks
> list.

Actually we do sent the signal to init but we shouldn't, if we want
our semantics straight.  And we drop the signal early enough we might
not see it.

> I think this is the only problem with this change.
>
> Eric, Roland?

That is a very subtle idea.  I wish I could convince myself it would
work and be maintainable.

> (Unfortunately, attach_pid() adds the task to the head of hlist, this
>  means we can't avoid tasklist for __kill_pgrp_info).

Probably.  If there wasn't a chance of sending a signal twice we
could rescan the list if the head changed.

What we might be able to do is to switch the tasklist_lock into a seq_lock.
like was done for the dcache.  The challenge is how to avoid resending
a signal when we retry.  Store the sequence number in the sighand_struct?

If we had a magic place that children could check. To see if they belonged
to a group of processes that was exiting that might help.

Grr.  I just can't see any solution that is cheaper and that I can
verify will be correct.

Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] kill_something_info: don't take tasklist_lock for pid==-1 case
  2008-05-21  3:47   ` Eric W. Biederman
@ 2008-05-26  7:03     ` Atsushi TSUJI
  2008-05-28 15:03       ` Eric W. Biederman
  2008-05-31 16:55     ` Oleg Nesterov
  1 sibling, 1 reply; 10+ messages in thread
From: Atsushi TSUJI @ 2008-05-26  7:03 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Oleg Nesterov, linux-kernel, Roland McGrath

Hi Eric,

Thank you for your comments.

Eric W. Biederman wrote:
> Sorry for the very delayed response.  I have been moving and overwhelmed
> with everything.
> 
> Oleg Nesterov <oleg@tv-sign.ru> writes:
> 
>> On 03/25, Atsushi Tsuji wrote:
>>> This patch avoid taking tasklist_lock in kill_something_info() when
>>> the pid is -1. It can convert to rcu_read_lock() for this case because
>>> group_send_sig_info() doesn't need tasklist_lock.
>>>
>>> This patch is for 2.6.25-rc5-mm1.
>>>
> 
>> Hmm. Yes, group_send_sig_info() doesn't need tasklist_lock. But we
>> take tasklist_lock to "freeze" the tasks list, so that we can't miss
>> a new forked process.
>>
>> Same for __kill_pgrp_info(), we take tasklist to kill the whole group
>> "atomically".
>>
>>
>> However. Is it really needed? copy_process() returns -ERESTARTNOINTR
>> if signal_pending(), and the new task is always placed at the tail
>> of the list. Looks like nobody can escape the signal, at least fatal
>> or SIGSTOP.
> 
> 
> Call me paranoid but I don't think there is any guarantee without a lock
> that we will hit the -ERESTARTNOITR check for new processes.  I think we
> have a slight race where the fork process may not have received the signal
> (because it is near the tail of the list) but the new process would be
> added to the list immediately after we read it's pointer.

I know it might happen some races, but, as Oleg say, it is no problem
on the user side. Users cannot realize whether the process forked
during kill or after. We can pretend it was forked after kill
finished. So I think the change to convert tasklist_lock to
rcu_read_lock is reasonable way to avoid the local DOS for kill(-1,sig) case.

>> (Unfortunately, attach_pid() adds the task to the head of hlist, this
>>  means we can't avoid tasklist for __kill_pgrp_info).
> 
> Probably.  If there wasn't a chance of sending a signal twice we
> could rescan the list if the head changed.
> 
> What we might be able to do is to switch the tasklist_lock into a seq_lock.
> like was done for the dcache.  The challenge is how to avoid resending
> a signal when we retry.  Store the sequence number in the sighand_struct?
> 
> If we had a magic place that children could check. To see if they belonged
> to a group of processes that was exiting that might help.

I think this idea is good for __kill_pgrp_info(), but it is a big change.

Thanks,
-Atsushi Tsuji

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] kill_something_info: don't take tasklist_lock for pid==-1 case
  2008-05-26  7:03     ` Atsushi TSUJI
@ 2008-05-28 15:03       ` Eric W. Biederman
  0 siblings, 0 replies; 10+ messages in thread
From: Eric W. Biederman @ 2008-05-28 15:03 UTC (permalink / raw)
  To: Atsushi TSUJI; +Cc: Oleg Nesterov, linux-kernel, Roland McGrath

Atsushi TSUJI <a-tsuji@bk.jp.nec.com> writes:

>> Call me paranoid but I don't think there is any guarantee without a lock
>> that we will hit the -ERESTARTNOITR check for new processes.  I think we
>> have a slight race where the fork process may not have received the signal
>> (because it is near the tail of the list) but the new process would be
>> added to the list immediately after we read it's pointer.
>
> I know it might happen some races, but, as Oleg say, it is no problem
> on the user side. Users cannot realize whether the process forked
> during kill or after. We can pretend it was forked after kill
> finished. So I think the change to convert tasklist_lock to
> rcu_read_lock is reasonable way to avoid the local DOS for kill(-1,sig) case.

We can only pretend that if the parent lives.  If the parent is killed
then we can not so pretend.

Which in a lot of ways is the problem.  kill(-1,SIGKIL) should
kill everything except for init and the process that sent the
signal.  If anything else survives we have a broken the shutdown
scripts.

Since the race would rarely hit it will take ages for someone
to trace back to a kernel change.

If I could convince myself that Oleg is correct and that what
Oleg is proposing will always work I don't have a problem.

Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] kill_something_info: don't take tasklist_lock for pid==-1 case
  2008-05-21  3:47   ` Eric W. Biederman
  2008-05-26  7:03     ` Atsushi TSUJI
@ 2008-05-31 16:55     ` Oleg Nesterov
  2008-05-31 23:55       ` Eric W. Biederman
  1 sibling, 1 reply; 10+ messages in thread
From: Oleg Nesterov @ 2008-05-31 16:55 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Atsushi Tsuji, linux-kernel, Roland McGrath

Sorry, sorry for the delay,

On 05/20, Eric W. Biederman wrote:
> 
> Oleg Nesterov <oleg@tv-sign.ru> writes:
> 
> > On 03/25, Atsushi Tsuji wrote:
> >> 
> >> This patch avoid taking tasklist_lock in kill_something_info() when
> >> the pid is -1. It can convert to rcu_read_lock() for this case because
> >> group_send_sig_info() doesn't need tasklist_lock.
> >> 
> >> This patch is for 2.6.25-rc5-mm1.
> >> 
> 
> > Hmm. Yes, group_send_sig_info() doesn't need tasklist_lock. But we
> > take tasklist_lock to "freeze" the tasks list, so that we can't miss
> > a new forked process.
> >
> > Same for __kill_pgrp_info(), we take tasklist to kill the whole group
> > "atomically".
> >
> >
> > However. Is it really needed? copy_process() returns -ERESTARTNOINTR
> > if signal_pending(), and the new task is always placed at the tail
> > of the list. Looks like nobody can escape the signal, at least fatal
> > or SIGSTOP.
> 
> 
> Call me paranoid but I don't think there is any guarantee without a lock
> that we will hit the -ERESTARTNOITR check for new processes.  I think we
> have a slight race where the fork process may not have received the signal
> (because it is near the tail of the list) but the new process would be
> added to the list immediately after we read it's pointer.

Hmm. could you clarify? I tend to always trust you, just can't understand
the text above...

However, I think this patch adds another subtle race which I missed before.

Let's suppose that the task has two threads, A (== main thread) and B. A has
already exited, B does exec.

In that case it is possible that (without tasklist_lock) kill_something_info()
sends the signal to the old leader (A), but before group_send_sig_info(A)
takes ->siglock B switches the leader and does release_task(A). In that
group_send_sig_info()->lock_task_sighand() fails and we miss the process.

> That is subtle.  Switching to the per task siglock for protection.
> 
> > Except: We don't send the signal to /sbin/init. This means that (say)
> > kill(-1, SIGKILL) can miss the task forked by init. Note that this
> > task could be forked even before we start kill_something_info(), but
> > without tasklist there is no guarantee we will see it on the ->tasks
> > list.
> 
> Actually we do sent the signal to init but we shouldn't,

Note the (broken) "p->pid > 1" check, kill_something_info() skips init.
Not that it matters though.

Oleg.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] kill_something_info: don't take tasklist_lock for pid==-1 case
  2008-05-31 16:55     ` Oleg Nesterov
@ 2008-05-31 23:55       ` Eric W. Biederman
  2008-06-01 16:29         ` Oleg Nesterov
  0 siblings, 1 reply; 10+ messages in thread
From: Eric W. Biederman @ 2008-05-31 23:55 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Atsushi Tsuji, linux-kernel, Roland McGrath

Oleg Nesterov <oleg@tv-sign.ru> writes:

> Sorry, sorry for the delay,
>
> On 05/20, Eric W. Biederman wrote:
>> 
>> Oleg Nesterov <oleg@tv-sign.ru> writes:
>> 
>> > On 03/25, Atsushi Tsuji wrote:
>> >> 
>> >> This patch avoid taking tasklist_lock in kill_something_info() when
>> >> the pid is -1. It can convert to rcu_read_lock() for this case because
>> >> group_send_sig_info() doesn't need tasklist_lock.
>> >> 
>> >> This patch is for 2.6.25-rc5-mm1.
>> >> 
>> 
>> > Hmm. Yes, group_send_sig_info() doesn't need tasklist_lock. But we
>> > take tasklist_lock to "freeze" the tasks list, so that we can't miss
>> > a new forked process.
>> >
>> > Same for __kill_pgrp_info(), we take tasklist to kill the whole group
>> > "atomically".
>> >
>> >
>> > However. Is it really needed? copy_process() returns -ERESTARTNOINTR
>> > if signal_pending(), and the new task is always placed at the tail
>> > of the list. Looks like nobody can escape the signal, at least fatal
>> > or SIGSTOP.
>> 
>> 
>> Call me paranoid but I don't think there is any guarantee without a lock
>> that we will hit the -ERESTARTNOITR check for new processes.  I think we
>> have a slight race where the fork process may not have received the signal
>> (because it is near the tail of the list) but the new process would be
>> added to the list immediately after we read it's pointer.
>
> Hmm. could you clarify? I tend to always trust you, just can't understand
> the text above...

We can read old next values when walking the task list under the rcu
lock.  So I don't believe we are guaranteed to see additions that
happen while we hold the rcu lock.

If a new process spawns, passes the check for the parent having the
signal, the signal is delivered the signal, and then appends to the
task list.  We might miss it.  I'm not certain, but that feels right.

> However, I think this patch adds another subtle race which I missed before.
>
> Let's suppose that the task has two threads, A (== main thread) and B. A has
> already exited, B does exec.
>
> In that case it is possible that (without tasklist_lock) kill_something_info()
> sends the signal to the old leader (A), but before group_send_sig_info(A)
> takes ->siglock B switches the leader and does release_task(A). In that
> group_send_sig_info()->lock_task_sighand() fails and we miss the process.

Hmm.  Does that problem affect normal signal deliver.  I seem to recall
being careful and doing something to make that case work.

Does that fix only apply when we have a specific pid, not when we have
a task and are walking the task_list.  Because kill_pid_info can retry
if we receive -ESRCH?

To fix the pid namespace case we need to start walking the list of
pids, not the task list for kill -1.

>> That is subtle.  Switching to the per task siglock for protection.
>> 
>> > Except: We don't send the signal to /sbin/init. This means that (say)
>> > kill(-1, SIGKILL) can miss the task forked by init. Note that this
>> > task could be forked even before we start kill_something_info(), but
>> > without tasklist there is no guarantee we will see it on the ->tasks
>> > list.
>> 
>> Actually we do sent the signal to init but we shouldn't,
>
> Note the (broken) "p->pid > 1" check, kill_something_info() skips init.
> Not that it matters though.

Oh right.  I had forgotten about that special case.  Grr Special cases
suck!

We have a hole with init spawning new children, during kill -1.
Ugh.  Or are those tasks indistinguishable from children spawned by
init just after the signal was sent?

A practical question.  I need to rework the signal delivery for
the case of kill -1 to be based on find_ge_pid.  So that it
works with pid namespaces.  That kills our nice in order list
property. :(  A rework does give the opportunity to investigate
these properties in detail, and see if there is a way to drop
the tasklist_lock.  Especially since we are already discussing it.

The guarantee is that the signal delivery needs to appear atomic
so that we don't miss any processes in the group of processes
the signal is sent to.  The rcu list gives us an effectively atomic
snapshot of the list.  However new processes can be added while we
walk that list.

To prevent the current visible DOS we need to only hold processes
in fork for the duration of our traversal of the set of all processes
in the system and then let them go.

The -ERESTARTNOINTR trick ensures the signal is delivered to the
parent before a fork executes.

Init only receives signals that it wants (the rest are effectively
ignored and discared) so it can be said to have forked after signal
delivery when there is a race.

We grab sighand lock on each process that we are delivering the signal
to.

This almost feels like a recipe for atomic signal delivery.  Without
an ordered list.  I just don't see it yet.  I will keep picking at it
since it sounds like we are close to the point where the tasklist is
a problem even on small systems.

Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] kill_something_info: don't take tasklist_lock for pid==-1 case
  2008-05-31 23:55       ` Eric W. Biederman
@ 2008-06-01 16:29         ` Oleg Nesterov
  0 siblings, 0 replies; 10+ messages in thread
From: Oleg Nesterov @ 2008-06-01 16:29 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Atsushi Tsuji, linux-kernel, Roland McGrath

On 05/31, Eric W. Biederman wrote:
>
> We can read old next values when walking the task list under the rcu
> lock.  So I don't believe we are guaranteed to see additions that
> happen while we hold the rcu lock.
>
> If a new process spawns, passes the check for the parent having the
> signal, the signal is delivered the signal, and then appends to the
> task list.  We might miss it.  I'm not certain, but that feels right.

I don't think we can miss it. To simplify, let's consider kill(-1, SIGKILL)
and the forking process is P.

When P forks, copy_process() adds the child to the end of the init_task.tasks
list under ->siglock.

When kill_something_info()->group_send_sig_info(P) suceeds, we must see the
child, because we locked the same ->siglock and thus we have the necessary
barrier. (more precisely, we must see the new next values once we locked
->siglock). And P can't fork again.

Yes, kill(-1, /*say*/ SIGCONT) is different, we can miss the child which was
forked _after_ P has recieved/handled the signal, but probably this is OK,
we can pretend it was forked after kill(-1) has returned.

The more interesting case: P forks and exits _before_ we send the signal
to it. Can we miss the child? I don't think so, but I'm not sure.
fork() + exit() means list_add_rcu() + wmb() + list_del_rcu().

If we see the result of list_del_rcu() (ie, we don't see P on list), we
must see the result of list_add_rcu(), because of smp_read_barrier_depends()
in next_task().

But again, I'm not sure.

> > However, I think this patch adds another subtle race which I missed before.
> >
> > Let's suppose that the task has two threads, A (== main thread) and B. A has
> > already exited, B does exec.
> >
> > In that case it is possible that (without tasklist_lock) kill_something_info()
> > sends the signal to the old leader (A), but before group_send_sig_info(A)
> > takes ->siglock B switches the leader and does release_task(A). In that
> > group_send_sig_info()->lock_task_sighand() fails and we miss the process.
>
> Hmm.  Does that problem affect normal signal deliver.  I seem to recall
> being careful and doing something to make that case work.
>
> Does that fix only apply when we have a specific pid, not when we have
> a task and are walking the task_list.  Because kill_pid_info can retry
> if we receive -ESRCH?

Yes, kill_pid_info() is fine, and other users call group_send_sig_info()
under tasklist.

> To fix the pid namespace case we need to start walking the list of
> pids, not the task list for kill -1.

Or, we can make somthing like

	/* needs rcu lock */
	int kill_group(int sig, struct siginfo *info, struct task_struct *g)
	{
		struct task_struct *p = g;

		do {
			ret = group_send_sig_info(sig, info, p);
			if (ret != -ESRCH)
				break;
		} while_each_thread(g, p);

		return ret;
	}

> > Note the (broken) "p->pid > 1" check, kill_something_info() skips init.
> > Not that it matters though.
>
> Oh right.  I had forgotten about that special case.  Grr Special cases
> suck!
>
> We have a hole with init spawning new children, during kill -1.

Yes, I was wondering about this too.

> Ugh.  Or are those tasks indistinguishable from children spawned by
> init just after the signal was sent?

Oh, I don't know what is supposed semantics. Perhaps this works in
practice during shutdown? we can change the state of /sbin/init so
that it won't spawn the new tasks, and then we can do kill(-1).
I don't know.

> A practical question.  I need to rework the signal delivery for
> the case of kill -1 to be based on find_ge_pid.  So that it
> works with pid namespaces.

Hmm... not sure I understand why do we need find_ge_pid(). In fact,
I can't see which problems we have with kill_something_info() wrt
pid namespaces. I mean it should be changed of course, but these
changes should be relatively simple/straightforward? However I didn't
really think about this, may be wrong.

Oleg.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-06-01 16:28 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-25  4:27 [PATCH] kill_something_info: don't take tasklist_lock for pid==-1 case Atsushi Tsuji
2008-03-25 13:56 ` Oleg Nesterov
2008-05-21  1:48   ` Atsushi Tsuji
2008-05-21  2:53     ` Eric W. Biederman
2008-05-21  3:47   ` Eric W. Biederman
2008-05-26  7:03     ` Atsushi TSUJI
2008-05-28 15:03       ` Eric W. Biederman
2008-05-31 16:55     ` Oleg Nesterov
2008-05-31 23:55       ` Eric W. Biederman
2008-06-01 16:29         ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox