From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1753485AbcIBMGl (ORCPT <rfc822;w@1wt.eu>);
        Fri, 2 Sep 2016 08:06:41 -0400
Received: from mx1.redhat.com ([209.132.183.28]:58834 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1752073AbcIBMGk (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 2 Sep 2016 08:06:40 -0400
Date: Fri, 2 Sep 2016 14:06:02 +0200
From: Oleg Nesterov <oleg@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>, Al Viro <viro@ZenIV.linux.org.uk>,
        Bart Van Assche <bvanassche@acm.org>,
        Johannes Weiner <hannes@cmpxchg.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Neil Brown <neilb@suse.de>, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/2] sched/wait: avoid abort_exclusive_wait() in
        __wait_on_bit_lock()
Message-ID: <20160902120601.GA26495@redhat.com>
References: <20160826124453.GA28894@redhat.com> <20160826124552.GB28904@redhat.com> <20160901190141.GJ10138@twins.programming.kicks-ass.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160901190141.GJ10138@twins.programming.kicks-ass.net>
User-Agent: Mutt/1.5.18 (2008-05-17)
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.38]); Fri, 02 Sep 2016 12:06:39 +0000 (UTC)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 09/01, Peter Zijlstra wrote:
>
> On Fri, Aug 26, 2016 at 02:45:52PM +0200, Oleg Nesterov wrote:
>
> > We do not need anything tricky to avoid the race,
>
> The race being:
>
> CPU0			CPU1			CPU2
>
> 			__wait_on_bit_lock()
> 			  bit_wait_io()
> 			    io_schedule()
>
> clear_bit_unlock()
> __wake_up_common(.nr_exclusive=1)
>   list_for_each_entry()
>     if (curr->func() && --nr_exclusive)
>       break
>
> 						signal()
>
> 			    if (signal_pending_state()) == TRUE
> 			      return -EINTR
>
> And no progress because CPU1 exits without acquiring the lock and CPU0
> thinks its done because it woke someone.

Yes,

> > we can just call finish_wait() if action() fails.
>
> That would be bit_wait*() returning -EINTR because sigpending.

Hmm. Not sure I understand... Let me reply just in case, even if
I am sure you get it right.

Yes, in the likely case we are going to fail with -EINTR, but only
if test-and-set after thar fails.

> Sure, you can always call that, first thing through the loop does
> prepare again, so no harm. That however does not connect to your
> condition,.. /me puzzled

If ->action() fails we will abort the loop in any case, prepare
won't be called. So in this case finish_wait() does the right thing.

> > test_and_set_bit() implies mb() so
> > the lockless list_empty_careful() case is fine, we can not miss the
> > condition if we race with unlock_page().
>
> You're talking about this ordering?:
>
> 	finish_wait()			clear_bit_unlock();
> 	  list_empty_careful()
>
> 	/* MB implied */		smp_mb__after_atomic();
> 	test_and_set_bit()		wake_up_page()
> 					  ...
> 					    autoremove_wake_function()
> 					      list_del_init();
>
>
> That could do with spelling out I feel.. :-)

Yes, yes.

> >  __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
> >  			wait_bit_action_f *action, unsigned mode)
> >  {
> > +	int ret = 0;
> >
> > +	for (;;) {
> >  		prepare_to_wait_exclusive(wq, &q->wait, mode);
> > +		if (test_bit(q->key.bit_nr, q->key.flags)) {
> > +			ret = action(&q->key, mode);
> > +			/*
> > +			 * Ensure that clear_bit() + wake_up() right after
> > +			 * test_and_set_bit() below can't see us; it should
> > +			 * wake up another exclusive waiter if we fail.
> > +			 */
> > +			if (ret)
> > +				finish_wait(wq, &q->wait);
> > +		}
> > +		if (!test_and_set_bit(q->key.bit_nr, q->key.flags)) {
>
> So this is the actual difference, instead of failing the lock and
> aborting on signal, we acquire the lock if possible. If its not
> possible, someone else has it, which guarantees that someone else will
> do an unlock which implies another wakeup and life goes on.

Yes. This way we eliminate the need for the additional wake_up.

Oleg.