Re: [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Waiman Long <waiman.long@hp.com>
To: Alexander Fyodorov <halcy@yandex.ru>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
	"Chandramouleeswaran, Aswin" <aswin@hp.com>,
	"Norton, Scott J" <scott.norton@hp.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>
Subject: Re: [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation
Date: Wed, 21 Aug 2013 21:04:14 -0400	[thread overview]
Message-ID: <5215638E.5020702@hp.com> (raw)
In-Reply-To: <336901377100289@web16f.yandex.ru>

On 08/21/2013 11:51 AM, Alexander Fyodorov wrote:
> 21.08.2013, 07:01, "Waiman Long"<waiman.long@hp.com>:
>> On 08/20/2013 11:31 AM, Alexander Fyodorov wrote:
>>> Isn't a race possible if another thread acquires the spinlock in the window
>>> between setting lock->locked to 0 and issuing smp_wmb()? Writes from
>>> the critical section from our thread might be delayed behind the write to
>>> lock->locked if the corresponding cache bank is busy.
>> The purpose of smp_wmb() is to make sure that content in the cache will
>> be flushed out to the memory in case the cache coherency protocol cannot
>> guarantee a single view of memory content on all processor.
> Linux kernel does not support architectures without cache coherency, and while using memory barriers just for flushing write buffers ASAP on cache-coherent processors might benefit performance on some architectures it will hurt performance on others. So it must not be done in architecture-independent code.
>
>> In other
>> word, smp_wmb() is used to make other processors see that the lock has
>> been freed ASAP. If another processor see that before smp_wmb(), it will
>> be even better as the latency is reduced. As the lock holder is the only
>> one that can release the lock, there is no race condition here.
> No, I was talking about the window between freeing lock and issuing smp_wmb(). What I meant is:
> 1) A = 0
> 2) CPU0 locks the spinlock protecting A.
> 3) CPU0 writes 1 to A, but the write gets stalled because the corresponding cache bank is busy.
> 4) CPU0 calls spin_unlock() and sets lock->locked to 0.
>
> If CPU1 does a spin_lock() right now, it will succeed (since lock->locked == 0). But the write to A hasn't reached cache yet, so CPU1 will see A == 0.
>
> More examples on this are in Documentation/memory-barriers.txt

In this case, we should have smp_wmb() before freeing the lock. The 
question is whether we need to do a full mb() instead. The x86 ticket 
spinlock unlock code is just a regular add instruction except for some 
exotic processors. So it is a compiler barrier but not really a memory 
fence. However, we may need to do a full memory fence for some other 
processors.

>> That is a legitimate question. I don't think it is a problem on x86 as
>> the x86 spinlock code doesn't do a full mb() in the lock and unlock
>> paths.
> It does because "lock" prefix implies a full memory barrier.
>
>> The smp_mb() will be conditionalized depending on the ARCH_QSPINLOCK
>> setting. The smp_wmb() may not be needed, but a compiler barrier should
>> still be there.
> Do you mean because of inline? That shouldn't be a problem because smp_mb() prohibits compiler from doing any optimizations across the barrier (thanks to the "volatile" keyword).

What I mean is that I don't want any delay in issuing the unlock 
instruction because of compiler rearrangement. So there should be at 
least a barrier() call at the end of the unlock function.

At this point, I am inclined to have either a smp_wmb() or smp_mb() at 
the beginning of the unlock function and a barrier() at the end.

> More on this in Documentation/memory-barriers.txt
>
>>>   Also I don't understand why there are so many uses of ACCESS_ONCE()
>>> macro. It does not guarantee memory ordering with regard to other CPUs,
>>> so probably most of the uses can be removed (with exception of
>>> lock_is_contended(), where it prohibits gcc from optimizing the loop away).
>> All the lock/unlock code can be inlined and we don't know what the
>> compiler will do to optimize code. The ACCESS_ONCE() macro is used to
>> make sure that the compiler won't optimize away the actual fetch or
>> write of the memory. Even if the compiler won't optimize away the memory
>> access, adding the ACCESS_ONCE() macro won't have any extra overhead. So
>> a more liberal use of it won't hurt performance.
> If compiler optimized memory access away it would be a bug. And I'm not so sure about overhead... For example, on some VLIW architectures ACCESS_ONCE() might prohibit compiler from mixing other instructions to the unlock.

As the lock/unlock functions can be inlined, it is possible that a 
memory variable can be accessed earlier in the calling function and the 
stale copy may be used in the inlined lock/unlock function instead of 
fetching a new copy. That is why I prefer a more liberal use of 
ACCESS_ONCE() for safety purpose.

next prev parent reply	other threads:[~2013-08-22  1:04 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <15321377012704@web8h.yandex.ru>
2013-08-21  3:01 ` [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long
2013-08-21 15:51   ` Alexander Fyodorov
2013-08-22  1:04     ` Waiman Long [this message]
2013-08-22 13:28       ` Alexander Fyodorov
2013-08-26 20:14         ` Waiman Long
2013-08-27 12:09           ` Alexander Fyodorov
     [not found]             ` <20130827091436.3d5971a0@gandalf.local.home>
2013-08-27 13:53               ` Peter Zijlstra
2013-08-28  1:21                 ` Paul E. McKenney
2013-08-28  8:19                   ` Peter Zijlstra
2013-08-28 12:59                     ` Steven Rostedt
2013-08-28 13:05                       ` Peter Zijlstra
2013-08-28 13:15                         ` Steven Rostedt
2013-08-28 13:37                           ` Peter Zijlstra
2013-08-29 15:24             ` Waiman Long
2013-08-29 17:03               ` Alexander Fyodorov
2013-08-30  3:16                 ` Waiman Long
2013-08-30  8:15                   ` Alexander Fyodorov
2013-08-13 18:41 [PATCH RFC v2 0/2] qspinlock: Introducing a 4-byte queue spinlock Waiman Long
2013-08-13 18:41 ` [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5215638E.5020702@hp.com \
    --to=waiman.long@hp.com \
    --cc=aswin@hp.com \
    --cc=halcy@yandex.ru \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=scott.norton@hp.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox