Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Andrea Parri <andrea.parri@amarulasolutions.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: will.deacon@arm.com, mingo@kernel.org,
	linux-kernel@vger.kernel.org, longman@redhat.com,
	tglx@linutronix.de
Subject: Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86
Date: Wed, 26 Sep 2018 22:52:08 +0200	[thread overview]
Message-ID: <20180926205208.GA4864@andrea> (raw)
In-Reply-To: <20180926111307.513429499@infradead.org>

On Wed, Sep 26, 2018 at 01:01:20PM +0200, Peter Zijlstra wrote:
> On x86 we cannot do fetch_or with a single instruction and end up
> using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> with a very tricky composite xchg8 + load.
> 
> The basic idea is that we use xchg8 to test-and-set the pending bit
> (when it is a byte) and then a load to fetch the whole word. Using
> two instructions of course opens a window we previously did not have.
> In particular the ordering between pending and tail is of interrest,
> because that is where the split happens.
> 
> The claim is that if we order them, it all works out just fine. There
> are two specific cases where the pending,tail state changes:
> 
>  - when the 3rd lock(er) comes in and finds pending set, it'll queue
>    and set tail; since we set tail while pending is set, the ordering
>    is split is not important (and not fundamentally different form
>    fetch_or). [*]
> 
>  - when the last queued lock holder acquires the lock (uncontended),
>    we clear the tail and set the lock byte. By first setting the
>    pending bit this cmpxchg will fail and the later load must then
>    see the remaining tail.
> 
> Another interesting scenario is where there are only 2 threads:
> 
> 	lock := (0,0,0)
> 
> 	CPU 0			CPU 1
> 
> 	lock()			lock()
> 	  trylock(-> 0,0,1)       trylock() /* fail */
> 	    return;               xchg_relaxed(pending, 1) (-> 0,1,1)
> 				  mb()
> 				  val = smp_load_acquire(*lock);
> 
> Where, without the mb() the load would've been allowed to return 0 for
> the locked byte.

If this were true, we would have a violation of "coherence":

C peterz

{}

P0(atomic_t *val)
{
	int r0;

	/* in queued_spin_trylock() */
	r0 = atomic_cmpxchg_acquire(val, 0, 1);
}

P1(atomic_t *val)
{
	int r0;
	int r1;

	/* in queued_spin_trylock() */
	r0 = atomic_cmpxchg_acquire(val, 0, 1); /* or atomic_read(val) */

	/* in queued_spin_lock_slowpath)() -- set_pending_fetch_acquire() */
	r1 = atomic_read_acquire(val);
}

exists (0:r0=0 /\ ~1:r0=0 /\ 1:r1=0)

  Andrea


> 
> [*] there is a fun scenario where the 3rd locker observes the pending
> bit, but before it sets the tail, the 1st lock (owner) unlocks and the
> 2nd locker acquires the lock, leaving us with a temporary 0,0,1 state.
> But this is not different between fetch_or or this new method.
> 
> Suggested-by: Will Deacon <will.deacon@arm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/include/asm/qspinlock.h |    2 +
>  kernel/locking/qspinlock.c       |   56 ++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 57 insertions(+), 1 deletion(-)
> 
> --- a/arch/x86/include/asm/qspinlock.h
> +++ b/arch/x86/include/asm/qspinlock.h
> @@ -9,6 +9,8 @@
>  
>  #define _Q_PENDING_LOOPS	(1 << 9)
>  
> +#define _Q_NO_FETCH_OR
> +
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>  extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
>  extern void __pv_init_lock_hash(void);
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -232,6 +232,60 @@ static __always_inline u32 xchg_tail(str
>  #endif /* _Q_PENDING_BITS == 8 */
>  
>  /**
> + * set_pending_fetch_acquire - fetch the whole lock value and set pending and locked
> + * @lock : Pointer to queued spinlock structure
> + * Return: The previous lock value
> + *
> + * *,*,* -> *,1,*
> + */
> +static __always_inline u32 set_pending_fetch_acquire(struct qspinlock *lock)
> +{
> +#if defined(_Q_NO_FETCH_OR) && _Q_PENDING_BITS == 8
> +	u32 val;
> +
> +	/*
> +	 * For the !LL/SC archs that do not have a native atomic_fetch_or
> +	 * but do have a native xchg (x86) we can do trickery and avoid the
> +	 * cmpxchg-loop based fetch-or to improve determinism.
> +	 *
> +	 * We first xchg the pending byte to set PENDING, and then issue a load
> +	 * for the rest of the word and mask out the pending byte to compute a
> +	 * 'full' value.
> +	 */
> +	val = xchg_relaxed(&lock->pending, 1) << _Q_PENDING_OFFSET;
> +	/*
> +	 * Ensures the tail load happens after the xchg().
> +	 *
> +	 *	   lock  unlock    (a)
> +	 *   xchg ---------------.
> +	 *    (b)  lock  unlock  +----- fetch_or
> +	 *   load ---------------'
> +	 *	   lock  unlock    (c)
> +	 *
> +	 * For both lock and unlock, (a) and (c) are the same as fetch_or(),
> +	 * since either are fully before or after. The only new case is (b),
> +	 * where a concurrent lock or unlock can interleave with the composite
> +	 * operation.
> +	 *
> +	 * In either case, it is the queueing case that is of interrest, otherwise
> +	 * the whole operation is covered by the xchg() and the tail will be 0.
> +	 *
> +	 * For lock-(b); we only care if we set the PENDING bit or not. If we lost
> +	 * the PENDING race, we queue. Otherwise we'd observe the tail anyway.
> +	 *
> +	 * For unlock-(b); since we'll have set PENDING, the uncontended claim
> +	 * will fail and we'll observe a !0 tail.
> +	 */
> +	smp_mb__after_atomic();
> +	val |= atomic_read_acquire(&lock->val) & ~_Q_PENDING_MASK;
> +
> +	return val;
> +#else
> +	return atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
> +#endif
> +}
> +
> +/**
>   * set_locked - Set the lock bit and own the lock
>   * @lock: Pointer to queued spinlock structure
>   *
> @@ -328,7 +382,7 @@ void queued_spin_lock_slowpath(struct qs
>  	 *
>  	 * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
>  	 */
> -	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
> +	val = set_pending_fetch_acquire(lock);
>  
>  	/*
>  	 * If we observe contention, there was a concurrent lock.
> 
>

next prev parent reply	other threads:[~2018-09-26 20:52 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-26 11:01 [RFC][PATCH 0/3] locking/qspinlock: Improve determinism for x86 Peter Zijlstra
2018-09-26 11:01 ` [RFC][PATCH 1/3] locking/qspinlock: Re-order code Peter Zijlstra
2018-10-01 17:17   ` Will Deacon
2018-09-26 11:01 ` [RFC][PATCH 2/3] locking/qspinlock: Rework some comments Peter Zijlstra
2018-10-01 17:17   ` Will Deacon
2018-10-01 19:10     ` Peter Zijlstra
2018-10-02 13:20       ` Will Deacon
2018-10-02 13:43         ` Peter Zijlstra
2018-09-26 11:01 ` [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86 Peter Zijlstra
2018-09-26 16:30   ` Waiman Long
2018-09-26 17:54     ` Peter Zijlstra
2018-09-27  7:29       ` Peter Zijlstra
2018-09-26 20:52   ` Andrea Parri [this message]
2018-09-27  7:17     ` Peter Zijlstra
2018-09-27  7:47       ` Andrea Parri
2018-09-27  7:59         ` Peter Zijlstra
2018-09-27  8:13           ` Andrea Parri
2018-09-27  8:57             ` Peter Zijlstra
2018-09-27 12:16   ` David Laight
2018-10-01 17:17   ` Will Deacon
2018-10-01 20:00     ` Peter Zijlstra
2018-10-02 13:19       ` Will Deacon
2018-10-02 14:14         ` Peter Zijlstra
2018-10-02 12:31     ` Andrea Parri
2018-10-02 13:22       ` Will Deacon
2018-10-02 13:44         ` Andrea Parri
2018-09-26 15:01 ` [RFC][PATCH 0/3] locking/qspinlock: Improve determinism " Sebastian Andrzej Siewior
2018-09-26 15:08   ` Thomas Gleixner
2018-09-26 15:38     ` Sebastian Andrzej Siewior
2018-09-26 16:20 ` Waiman Long
2018-09-26 17:51   ` Peter Zijlstra
2018-09-26 23:21     ` Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180926205208.GA4864@andrea \
    --to=andrea.parri@amarulasolutions.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox