Re: [rfc] "fair" rw spinlocks - Eric W. Biederman

All of lore.kernel.org
 help / color / mirror / Atom feed

From: ebiederm@xmission.com (Eric W. Biederman)
To: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@elte.hu>,
	Christoph Hellwig <hch@infradead.org>,
	Nick Piggin <npiggin@suse.de>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Subject: Re: [rfc] "fair" rw spinlocks
Date: Wed, 09 Dec 2009 19:36:26 -0800	[thread overview]
Message-ID: <m1pr6npkjp.fsf@fess.ebiederm.org> (raw)
In-Reply-To: <20091209153709.GA13192@redhat.com> (Oleg Nesterov's message of "Wed\, 9 Dec 2009 16\:37\:09 +0100")

Oleg Nesterov <oleg@redhat.com> writes:

> On 12/07, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@redhat.com> writes:
>>
>> > On 12/05, Eric W. Biederman wrote:
>> >>
>> >> Atomically sending signal to every member of a process group, is the
>> >> big fly in the ointment I am aware of.  Last time I looked I could
>> >> not see how to convert it rcu.
>> >
>> > I am not sure, but iirc we can do this lockless (under rcu_lock).
>> > We need to modify pid_link to use list_entry and attach_pid() should
>> > add the new task to the end. Of course we need more changes, but
>> > (again iirc) this is not too hard.
>>
>> The problem is that even adding to the end of the list, we could run
>> into a deleted entry and not see the new end of the list.
>>
>> Suppose when we start iterating the list we have:
>>
>>   A -> B -> C -> D
>>
>> Then someone deletes some of the entries while we are iterating the list.
>>
>> A ->
>>  B' -> C' -> D'
>>
>> We will continue on traversing through the deleted entries.
>>
>> Then someone adds a new entry to the end of the list.
>>
>> A-> N
>>
>> Since we are at B', C' or D' we will never see the new entry on the
>> end of the list.
>
> Yes, but who can add the new entry?
>
> Let's forget about setpgrp/etc for the moment, I think we have "races"
> with or without tasklist. Say, setpgrp() can add the new process to the
> already "killed" pgrp.

Agreed. A setpgrp call that we miss can be considered to have happened
after the signal was sent to the process group.

> Then, I think the only important case is SIGKILL/SIGSTOP (or other
> signals which can't be blockes/ignored). We must kill/stop the entire
> pgrp, we must not race with fork() and miss a child.
>
> In this case I _think_ rcu_read_lock() is enough,
>
> 	rcu_read_lock()
>
> 	list_for_each_entry_rcu(task, pid->tasks[PIDTYPE_PGID)
> 		group_send_sig_info(sig, task);
>
> 	rcu_read_unlock();
>
> except group_send_sig_info() can race with mt-exec, but this is simple
> to fix.

After we change the code to always add new entries to the end of
the rcu list you are correct.

The danger I saw is having a new process ( that we must handle ) show
up while we have gotten into a rcu cul-de-sac of the process list,
where we do not see new processes added to the end.

Once a process has gotten a signal we don't care, any children it spawns
happen after the signal was sent.

So we only care about children for processes that we have not delivered
the signal to.  If we add children to the end of the list they will be
visible if you traverse through the parent.  Since we have not traversed
through the parent and delivered the signal (by definition) then we don't
care.

This works for all signals, and especially for SIGKILL and SIGSTOP.

I am tempted to apply the test that if user space can prove the ordering
is not what is expected than there is a problem.  However signals have
not always arrived strongly orders with other user space events, and
the tasklist lock today does not appear to preclude a task that has
received a signal from talking to a task that has not yet received a
signal via pipes, etc.  

So as long as we can guarantee that if you traverse a parent you will
also traverse all of the children forked before you traversed the
parent, then we should be fine.


> If we send a signal (not necessary SIGKILL) to a process P, we must see
> all childs which were forked by P, both send_signal() and copy_process()
> take the same ->siglock, we must see the result of list_add_tail_rcu().
> And, after we sent SIGKILL/SIGSTOP, it can't fork the new child.

Sounds right.

> If list_for_each_entry() does not see the exited process P, this means
> we see the result of list_del_rcu(). But this also means we must the
> the result of the previous list_add_rcu().
>
> IOW, fork+exit means list_add_rcu() + wmb() + list_del_rcu(), if we
> don't see the new entry on list, we must see the new one, right?
>
> (I am ignoring the case when list_for_each_entry_rcu() sees a process
>  P but lock_task_sighand(P) fails, I think this is the same as if we
>  we missed P)
>
> Now suppose a signal is blocked/ignored or has a handler. In this case
> we can miss a child, but I think this is OK, we can pretend the new
> child was forked after kill_pgrp() completes. Say, this child C was
> forked by some process P. We can miss C only if it was forked after
> we already sent the signal to P.

I don't see how we can miss a child that matters.

> However. I do not pretend the reasoning above is "complete", and
> perhaps I missed something else.

I can not find fault with your idea.

Eric

next prev parent reply	other threads:[~2009-12-10  3:36 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-11-23 14:54 [rfc] "fair" rw spinlocks Nick Piggin
2009-11-24 20:19 ` David Miller
2009-11-25  6:52   ` Nick Piggin
2009-11-25  8:49   ` Andi Kleen
2009-11-25  8:56     ` Nick Piggin
2009-11-24 20:47 ` Andi Kleen
2009-11-25  6:54   ` Nick Piggin
2009-11-25  8:48     ` Andi Kleen
2009-11-25 13:09       ` Arnd Bergmann
2009-11-28  2:07 ` Paul E. McKenney
2009-11-28 11:15   ` Andi Kleen
2009-11-28 15:20     ` Paul E. McKenney
2009-11-28 17:30 ` Linus Torvalds
2009-11-29 18:51   ` Paul E. McKenney
2009-11-30  7:57     ` Nick Piggin
2009-11-30  7:55   ` Nick Piggin
2009-11-30 15:22     ` Linus Torvalds
2009-11-30 15:40       ` Nick Piggin
2009-11-30 16:07         ` Linus Torvalds
2009-11-30 16:17           ` Nick Piggin
2009-11-30 16:39           ` Paul E. McKenney
2009-11-30 17:05             ` Linus Torvalds
2009-11-30 17:13               ` Nick Piggin
2009-11-30 17:18                 ` Linus Torvalds
2009-12-01 17:03                   ` Arnd Bergmann
2009-12-01 17:15                     ` Linus Torvalds
2009-11-30 18:29                 ` Paul E. McKenney
2009-11-30 16:20     ` Paul E. McKenney
2009-11-30 10:00   ` Christoph Hellwig
2009-11-30 15:52     ` Linus Torvalds
2009-11-30 17:46       ` Ingo Molnar
2009-11-30 21:12         ` Thomas Gleixner
2009-11-30 21:27           ` Peter Zijlstra
2009-11-30 22:02             ` Thomas Gleixner
2009-11-30 22:11               ` Linus Torvalds
2009-11-30 22:37                 ` Thomas Gleixner
2009-11-30 22:49                   ` Linus Torvalds
2009-12-01 17:37                     ` [PATCH] audit: Call tty_audit_push_task() outside preempt disabled region Thomas Gleixner
2009-12-01 18:22                       ` Oleg Nesterov
2009-12-01 19:53                         ` Thomas Gleixner
2009-12-06  3:12                     ` [rfc] "fair" rw spinlocks Eric W. Biederman
2009-12-07 18:18                       ` Paul E. McKenney
2009-12-07 22:24                         ` Eric W. Biederman
2009-12-07 22:35                           ` Andi Kleen
2009-12-07 23:19                             ` Eric W. Biederman
2009-12-08  1:39                               ` Paul E. McKenney
2009-12-08  2:11                                 ` Eric W. Biederman
2009-12-08  2:37                                   ` Paul E. McKenney
2009-12-07 18:32                       ` Oleg Nesterov
2009-12-07 20:38                         ` Peter Zijlstra
2009-12-09 15:55                           ` Oleg Nesterov
2009-12-07 22:10                         ` Eric W. Biederman
2009-12-09 15:37                           ` Oleg Nesterov
2009-12-10  3:36                             ` Eric W. Biederman [this message]
2009-12-10  6:22                             ` Paul E. McKenney
2009-12-10 10:31                               ` Eric W. Biederman
2009-12-10 16:41                                 ` Paul E. McKenney
2009-12-01 19:01 ` Mathieu Desnoyers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m1pr6npkjp.fsf@fess.ebiederm.org \
    --to=ebiederm@xmission.com \
    --cc=hch@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=npiggin@suse.de \
    --cc=oleg@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.