* Re: [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation [not found] <15321377012704@web8h.yandex.ru> @ 2013-08-21 3:01 ` Waiman Long 2013-08-21 15:51 ` Alexander Fyodorov 0 siblings, 1 reply; 18+ messages in thread From: Waiman Long @ 2013-08-21 3:01 UTC (permalink / raw) To: Alexander Fyodorov Cc: linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J, Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar On 08/20/2013 11:31 AM, Alexander Fyodorov wrote: > Hi Waiman, > > I'm not subscribed to the lkml so I'm writing to you instead. In your patch you put smp_wmb() at the end of spin_unlock(): > > +static __always_inline void queue_spin_unlock(struct qspinlock *lock) > +{ > + /* > + * This unlock function may get inlined into the caller. The > + * compiler barrier is needed to make sure that the lock will > + * not be released before changes to the critical section is done. > + * The ending smp_wmb() allows queue head process to get the lock ASAP. > + */ > + barrier(); > +#ifndef QSPINLOCK_MANYCPUS > + { > + u16 qcode = ACCESS_ONCE(lock->locked); > + if (unlikely(qcode != QSPINLOCK_LOCKED)) > + queue_spin_unlock_slowpath(lock, qcode); > + } > +#endif > + ACCESS_ONCE(lock->locked) = 0; > + smp_wmb(); > +} > > Isn't a race possible if another thread acquires the spinlock in the window between setting lock->locked to 0 and issuing smp_wmb()? Writes from the critical section from our thread might be delayed behind the write to lock->locked if the corresponding cache bank is busy. The purpose of smp_wmb() is to make sure that content in the cache will be flushed out to the memory in case the cache coherency protocol cannot guarantee a single view of memory content on all processor. In other word, smp_wmb() is used to make other processors see that the lock has been freed ASAP. If another processor see that before smp_wmb(), it will be even better as the latency is reduced. As the lock holder is the only one that can release the lock, there is no race condition here. > And shouldn't it be a full memory barrier (smp_mb()) instead? Because we want loads from critical section to finish too before giving spinlock to another thread (which might change memory contents). That is a legitimate question. I don't think it is a problem on x86 as the x86 spinlock code doesn't do a full mb() in the lock and unlock paths. However, other architectures like ARM do requires a full mb() in the lock and unlock paths. A full mb() will be costly in x86. A possible solution is to make ARCH_QSPINLOCK a tri-state variable whose value can be off, on with mb, on without mb. The QSPINLOCK parameter will then become a hidden one. > Comment in queue_spin_unlock() says: "The ending smp_wmb() allows queue head process to get the lock ASAP". But memory barriers can be costly on some architectures so issuing one just to empty write buffers might result in performance loss on them. Such things should be left to hardware IMHO. > > So I think unlock() should look like this: > > +static __always_inline void queue_spin_unlock(struct qspinlock *lock) > +{ > + /* > + * Wait for the critical section to finish memory accesses. > + */ > + smp_mb(); > + > +#ifndef QSPINLOCK_MANYCPUS > + { > + u16 qcode = ACCESS_ONCE(lock->locked); > + if (unlikely(qcode != QSPINLOCK_LOCKED)) > + queue_spin_unlock_slowpath(lock, qcode); > + } > +#endif > + ACCESS_ONCE(lock->locked) = 0; > +} The smp_mb() will be conditionalized depending on the ARCH_QSPINLOCK setting. The smp_wmb() may not be needed, but a compiler barrier should still be there. > Also I don't understand why there are so many uses of ACCESS_ONCE() macro. It does not guarantee memory ordering with regard to other CPUs, so probably most of the uses can be removed (with exception of lock_is_contended(), where it prohibits gcc from optimizing the loop away). All the lock/unlock code can be inlined and we don't know what the compiler will do to optimize code. The ACCESS_ONCE() macro is used to make sure that the compiler won't optimize away the actual fetch or write of the memory. Even if the compiler won't optimize away the memory access, adding the ACCESS_ONCE() macro won't have any extra overhead. So a more liberal use of it won't hurt performance. Regards, Longman ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-21 3:01 ` [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long @ 2013-08-21 15:51 ` Alexander Fyodorov 2013-08-22 1:04 ` Waiman Long 0 siblings, 1 reply; 18+ messages in thread From: Alexander Fyodorov @ 2013-08-21 15:51 UTC (permalink / raw) To: Waiman Long Cc: linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J, Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar 21.08.2013, 07:01, "Waiman Long" <waiman.long@hp.com>: > On 08/20/2013 11:31 AM, Alexander Fyodorov wrote: >> Isn't a race possible if another thread acquires the spinlock in the window >> between setting lock->locked to 0 and issuing smp_wmb()? Writes from >> the critical section from our thread might be delayed behind the write to >> lock->locked if the corresponding cache bank is busy. > > The purpose of smp_wmb() is to make sure that content in the cache will > be flushed out to the memory in case the cache coherency protocol cannot > guarantee a single view of memory content on all processor. Linux kernel does not support architectures without cache coherency, and while using memory barriers just for flushing write buffers ASAP on cache-coherent processors might benefit performance on some architectures it will hurt performance on others. So it must not be done in architecture-independent code. > In other > word, smp_wmb() is used to make other processors see that the lock has > been freed ASAP. If another processor see that before smp_wmb(), it will > be even better as the latency is reduced. As the lock holder is the only > one that can release the lock, there is no race condition here. No, I was talking about the window between freeing lock and issuing smp_wmb(). What I meant is: 1) A = 0 2) CPU0 locks the spinlock protecting A. 3) CPU0 writes 1 to A, but the write gets stalled because the corresponding cache bank is busy. 4) CPU0 calls spin_unlock() and sets lock->locked to 0. If CPU1 does a spin_lock() right now, it will succeed (since lock->locked == 0). But the write to A hasn't reached cache yet, so CPU1 will see A == 0. More examples on this are in Documentation/memory-barriers.txt > That is a legitimate question. I don't think it is a problem on x86 as > the x86 spinlock code doesn't do a full mb() in the lock and unlock > paths. It does because "lock" prefix implies a full memory barrier. > The smp_mb() will be conditionalized depending on the ARCH_QSPINLOCK > setting. The smp_wmb() may not be needed, but a compiler barrier should > still be there. Do you mean because of inline? That shouldn't be a problem because smp_mb() prohibits compiler from doing any optimizations across the barrier (thanks to the "volatile" keyword). More on this in Documentation/memory-barriers.txt >> Also I don't understand why there are so many uses of ACCESS_ONCE() >> macro. It does not guarantee memory ordering with regard to other CPUs, >> so probably most of the uses can be removed (with exception of >> lock_is_contended(), where it prohibits gcc from optimizing the loop away). > > All the lock/unlock code can be inlined and we don't know what the > compiler will do to optimize code. The ACCESS_ONCE() macro is used to > make sure that the compiler won't optimize away the actual fetch or > write of the memory. Even if the compiler won't optimize away the memory > access, adding the ACCESS_ONCE() macro won't have any extra overhead. So > a more liberal use of it won't hurt performance. If compiler optimized memory access away it would be a bug. And I'm not so sure about overhead... For example, on some VLIW architectures ACCESS_ONCE() might prohibit compiler from mixing other instructions to the unlock. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-21 15:51 ` Alexander Fyodorov @ 2013-08-22 1:04 ` Waiman Long 2013-08-22 13:28 ` Alexander Fyodorov 0 siblings, 1 reply; 18+ messages in thread From: Waiman Long @ 2013-08-22 1:04 UTC (permalink / raw) To: Alexander Fyodorov Cc: linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J, Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar On 08/21/2013 11:51 AM, Alexander Fyodorov wrote: > 21.08.2013, 07:01, "Waiman Long"<waiman.long@hp.com>: >> On 08/20/2013 11:31 AM, Alexander Fyodorov wrote: >>> Isn't a race possible if another thread acquires the spinlock in the window >>> between setting lock->locked to 0 and issuing smp_wmb()? Writes from >>> the critical section from our thread might be delayed behind the write to >>> lock->locked if the corresponding cache bank is busy. >> The purpose of smp_wmb() is to make sure that content in the cache will >> be flushed out to the memory in case the cache coherency protocol cannot >> guarantee a single view of memory content on all processor. > Linux kernel does not support architectures without cache coherency, and while using memory barriers just for flushing write buffers ASAP on cache-coherent processors might benefit performance on some architectures it will hurt performance on others. So it must not be done in architecture-independent code. > >> In other >> word, smp_wmb() is used to make other processors see that the lock has >> been freed ASAP. If another processor see that before smp_wmb(), it will >> be even better as the latency is reduced. As the lock holder is the only >> one that can release the lock, there is no race condition here. > No, I was talking about the window between freeing lock and issuing smp_wmb(). What I meant is: > 1) A = 0 > 2) CPU0 locks the spinlock protecting A. > 3) CPU0 writes 1 to A, but the write gets stalled because the corresponding cache bank is busy. > 4) CPU0 calls spin_unlock() and sets lock->locked to 0. > > If CPU1 does a spin_lock() right now, it will succeed (since lock->locked == 0). But the write to A hasn't reached cache yet, so CPU1 will see A == 0. > > More examples on this are in Documentation/memory-barriers.txt In this case, we should have smp_wmb() before freeing the lock. The question is whether we need to do a full mb() instead. The x86 ticket spinlock unlock code is just a regular add instruction except for some exotic processors. So it is a compiler barrier but not really a memory fence. However, we may need to do a full memory fence for some other processors. >> That is a legitimate question. I don't think it is a problem on x86 as >> the x86 spinlock code doesn't do a full mb() in the lock and unlock >> paths. > It does because "lock" prefix implies a full memory barrier. > >> The smp_mb() will be conditionalized depending on the ARCH_QSPINLOCK >> setting. The smp_wmb() may not be needed, but a compiler barrier should >> still be there. > Do you mean because of inline? That shouldn't be a problem because smp_mb() prohibits compiler from doing any optimizations across the barrier (thanks to the "volatile" keyword). What I mean is that I don't want any delay in issuing the unlock instruction because of compiler rearrangement. So there should be at least a barrier() call at the end of the unlock function. At this point, I am inclined to have either a smp_wmb() or smp_mb() at the beginning of the unlock function and a barrier() at the end. > More on this in Documentation/memory-barriers.txt > >>> Also I don't understand why there are so many uses of ACCESS_ONCE() >>> macro. It does not guarantee memory ordering with regard to other CPUs, >>> so probably most of the uses can be removed (with exception of >>> lock_is_contended(), where it prohibits gcc from optimizing the loop away). >> All the lock/unlock code can be inlined and we don't know what the >> compiler will do to optimize code. The ACCESS_ONCE() macro is used to >> make sure that the compiler won't optimize away the actual fetch or >> write of the memory. Even if the compiler won't optimize away the memory >> access, adding the ACCESS_ONCE() macro won't have any extra overhead. So >> a more liberal use of it won't hurt performance. > If compiler optimized memory access away it would be a bug. And I'm not so sure about overhead... For example, on some VLIW architectures ACCESS_ONCE() might prohibit compiler from mixing other instructions to the unlock. As the lock/unlock functions can be inlined, it is possible that a memory variable can be accessed earlier in the calling function and the stale copy may be used in the inlined lock/unlock function instead of fetching a new copy. That is why I prefer a more liberal use of ACCESS_ONCE() for safety purpose. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-22 1:04 ` Waiman Long @ 2013-08-22 13:28 ` Alexander Fyodorov 2013-08-26 20:14 ` Waiman Long 0 siblings, 1 reply; 18+ messages in thread From: Alexander Fyodorov @ 2013-08-22 13:28 UTC (permalink / raw) To: Waiman Long Cc: linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J, Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar 22.08.2013, 05:04, "Waiman Long" <waiman.long@hp.com>: > On 08/21/2013 11:51 AM, Alexander Fyodorov wrote: > In this case, we should have smp_wmb() before freeing the lock. The > question is whether we need to do a full mb() instead. The x86 ticket > spinlock unlock code is just a regular add instruction except for some > exotic processors. So it is a compiler barrier but not really a memory > fence. However, we may need to do a full memory fence for some other > processors. The thing is that x86 ticket spinlock code does have full memory barriers both in lock() and unlock() code: "add" instruction there has "lock" prefix which implies a full memory barrier. So it is better to use smp_mb() and let each architecture define it. > At this point, I am inclined to have either a smp_wmb() or smp_mb() at > the beginning of the unlock function and a barrier() at the end. > > As the lock/unlock functions can be inlined, it is possible that a > memory variable can be accessed earlier in the calling function and the > stale copy may be used in the inlined lock/unlock function instead of > fetching a new copy. That is why I prefer a more liberal use of > ACCESS_ONCE() for safety purpose. That is impossible: both lock() and unlock() must have either full memory barrier or an atomic operation which returns value. Both of them prohibit optimizations and compiler cannot reuse any global variable. So this usage of ACCESS_ONCE() is unneeded. You can read more on this in Documentation/volatile-considered-harmful.txt And although I already suggested that, have you read Documentation/memory-barriers.txt? There is a lot of valuable information there. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-22 13:28 ` Alexander Fyodorov @ 2013-08-26 20:14 ` Waiman Long 2013-08-27 12:09 ` Alexander Fyodorov 0 siblings, 1 reply; 18+ messages in thread From: Waiman Long @ 2013-08-26 20:14 UTC (permalink / raw) To: Alexander Fyodorov Cc: linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J, Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar On 08/22/2013 09:28 AM, Alexander Fyodorov wrote: > 22.08.2013, 05:04, "Waiman Long"<waiman.long@hp.com>: >> On 08/21/2013 11:51 AM, Alexander Fyodorov wrote: >> In this case, we should have smp_wmb() before freeing the lock. The >> question is whether we need to do a full mb() instead. The x86 ticket >> spinlock unlock code is just a regular add instruction except for some >> exotic processors. So it is a compiler barrier but not really a memory >> fence. However, we may need to do a full memory fence for some other >> processors. > The thing is that x86 ticket spinlock code does have full memory barriers both in lock() and unlock() code: "add" instruction there has "lock" prefix which implies a full memory barrier. So it is better to use smp_mb() and let each architecture define it. I also thought that the x86 spinlock unlock path was an atomic add. It just comes to my realization recently that this is not the case. The UNLOCK_LOCK_PREFIX will be mapped to "" except for some old 32-bit x86 processors. >> At this point, I am inclined to have either a smp_wmb() or smp_mb() at >> the beginning of the unlock function and a barrier() at the end. >> >> As the lock/unlock functions can be inlined, it is possible that a >> memory variable can be accessed earlier in the calling function and the >> stale copy may be used in the inlined lock/unlock function instead of >> fetching a new copy. That is why I prefer a more liberal use of >> ACCESS_ONCE() for safety purpose. > That is impossible: both lock() and unlock() must have either full memory barrier or an atomic operation which returns value. Both of them prohibit optimizations and compiler cannot reuse any global variable. So this usage of ACCESS_ONCE() is unneeded. > > You can read more on this in Documentation/volatile-considered-harmful.txt > > And although I already suggested that, have you read Documentation/memory-barriers.txt? There is a lot of valuable information there. I did read Documentation/memory-barriers.txt. I will read volatile-considered-harmful.txt. Regards, Longman ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-26 20:14 ` Waiman Long @ 2013-08-27 12:09 ` Alexander Fyodorov [not found] ` <20130827091436.3d5971a0@gandalf.local.home> 2013-08-29 15:24 ` Waiman Long 0 siblings, 2 replies; 18+ messages in thread From: Alexander Fyodorov @ 2013-08-27 12:09 UTC (permalink / raw) To: Waiman Long Cc: linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J, Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar > I also thought that the x86 spinlock unlock path was an atomic add. It > just comes to my realization recently that this is not the case. The > UNLOCK_LOCK_PREFIX will be mapped to "" except for some old 32-bit x86 > processors. Hmm, I didn't know that. Looking through Google found these rules for x86 memory ordering: * Loads are not reordered with other loads. * Stores are not reordered with other stores. * Stores are not reordered with older loads. So x86 memory model is rather strict and memory barrier is really not needed in the unlock path - xadd is a store and thus behaves like a memory barrier, and since only lock's owner modifies "ticket.head" the "add" instruction need not be atomic. But this is true only for x86, other architectures have more relaxed memory ordering. Maybe we should allow arch code to redefine queue_spin_unlock()? And define a version without smp_mb() for x86? ^ permalink raw reply [flat|nested] 18+ messages in thread
[parent not found: <20130827091436.3d5971a0@gandalf.local.home>]
* Re: [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation [not found] ` <20130827091436.3d5971a0@gandalf.local.home> @ 2013-08-27 13:53 ` Peter Zijlstra 2013-08-28 1:21 ` Paul E. McKenney 0 siblings, 1 reply; 18+ messages in thread From: Peter Zijlstra @ 2013-08-27 13:53 UTC (permalink / raw) To: Steven Rostedt Cc: Alexander Fyodorov, Waiman Long, linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J, Thomas Gleixner, Ingo Molnar, Paul E. McKenney On Tue, Aug 27, 2013 at 09:14:36AM -0400, Steven Rostedt wrote: > I just had this conversation with Paul McKenney. Should there be a > smp_mb_after_spin_unlock()? Depends on the benefits I suppose :-) Oleg and Linus did recently add smp_mb__before_spinlock(); > Although we blew it off as adding too many extensions to smp_mb(). But > it may be better than reimplementing something as complex as a lock. Locks should be as light weight as possible and never implement anything heavier than the ACQUISITION / RELEASE barriers if at all possible. We should certainly not re-implement spinlocks just to get full barriers out of them, that's crazy. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-27 13:53 ` Peter Zijlstra @ 2013-08-28 1:21 ` Paul E. McKenney 2013-08-28 8:19 ` Peter Zijlstra 0 siblings, 1 reply; 18+ messages in thread From: Paul E. McKenney @ 2013-08-28 1:21 UTC (permalink / raw) To: Peter Zijlstra Cc: Steven Rostedt, Alexander Fyodorov, Waiman Long, linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J, Thomas Gleixner, Ingo Molnar On Tue, Aug 27, 2013 at 03:53:10PM +0200, Peter Zijlstra wrote: > On Tue, Aug 27, 2013 at 09:14:36AM -0400, Steven Rostedt wrote: > > > I just had this conversation with Paul McKenney. Should there be a > > smp_mb_after_spin_unlock()? > > Depends on the benefits I suppose :-) Oleg and Linus did recently add > smp_mb__before_spinlock(); > > > Although we blew it off as adding too many extensions to smp_mb(). But > > it may be better than reimplementing something as complex as a lock. > > Locks should be as light weight as possible and never implement anything > heavier than the ACQUISITION / RELEASE barriers if at all possible. We > should certainly not re-implement spinlocks just to get full barriers > out of them, that's crazy. An unlock followed by a lock needs to act like a full barrier, but there is no requirement that a lock or unlock taken separately act like a full barrier. Thanx, Paul ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-28 1:21 ` Paul E. McKenney @ 2013-08-28 8:19 ` Peter Zijlstra 2013-08-28 12:59 ` Steven Rostedt 0 siblings, 1 reply; 18+ messages in thread From: Peter Zijlstra @ 2013-08-28 8:19 UTC (permalink / raw) To: Paul E. McKenney Cc: Steven Rostedt, Alexander Fyodorov, Waiman Long, linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J, Thomas Gleixner, Ingo Molnar On Tue, Aug 27, 2013 at 06:21:29PM -0700, Paul E. McKenney wrote: > On Tue, Aug 27, 2013 at 03:53:10PM +0200, Peter Zijlstra wrote: > > On Tue, Aug 27, 2013 at 09:14:36AM -0400, Steven Rostedt wrote: > > > > > I just had this conversation with Paul McKenney. Should there be a > > > smp_mb_after_spin_unlock()? > > > > Depends on the benefits I suppose :-) Oleg and Linus did recently add > > smp_mb__before_spinlock(); > > > > > Although we blew it off as adding too many extensions to smp_mb(). But > > > it may be better than reimplementing something as complex as a lock. > > > > Locks should be as light weight as possible and never implement anything > > heavier than the ACQUISITION / RELEASE barriers if at all possible. We > > should certainly not re-implement spinlocks just to get full barriers > > out of them, that's crazy. > > An unlock followed by a lock needs to act like a full barrier, but there > is no requirement that a lock or unlock taken separately act like a > full barrier. But that is already a property of the acquisition/release barrier. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-28 8:19 ` Peter Zijlstra @ 2013-08-28 12:59 ` Steven Rostedt 2013-08-28 13:05 ` Peter Zijlstra 0 siblings, 1 reply; 18+ messages in thread From: Steven Rostedt @ 2013-08-28 12:59 UTC (permalink / raw) To: Peter Zijlstra Cc: Paul E. McKenney, Alexander Fyodorov, Waiman Long, linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J, Thomas Gleixner, Ingo Molnar On Wed, 28 Aug 2013 10:19:37 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > > An unlock followed by a lock needs to act like a full barrier, but there > > is no requirement that a lock or unlock taken separately act like a > > full barrier. > > But that is already a property of the acquisition/release barrier. As I mentioned in my fixes for the -rt swait barrier patches I sent. Spin locks only prevent leaks out of the critical section. It does not guarantee leaks into the critical section, thus: A = 1 spin_lock() spin_unlock() B = C Can turn into: (A = 1) spin_lock() load C store 1 into A spin_unlock() B = C This shows that a spin_lock()/unlock() combo is not equivalent to a mb(). But as Paul has mentioned, if we had: A = 1 spin_unlock() spin_lock() B = C That would be equivalent to A = 1 mb() B = C as the unlock prevents leaks going past it, and lock prevents leaks going before it. -- Steve ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-28 12:59 ` Steven Rostedt @ 2013-08-28 13:05 ` Peter Zijlstra 2013-08-28 13:15 ` Steven Rostedt 0 siblings, 1 reply; 18+ messages in thread From: Peter Zijlstra @ 2013-08-28 13:05 UTC (permalink / raw) To: Steven Rostedt Cc: Paul E. McKenney, Alexander Fyodorov, Waiman Long, linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J, Thomas Gleixner, Ingo Molnar On Wed, Aug 28, 2013 at 08:59:57AM -0400, Steven Rostedt wrote: > On Wed, 28 Aug 2013 10:19:37 +0200 > Peter Zijlstra <peterz@infradead.org> wrote: > > > > > An unlock followed by a lock needs to act like a full barrier, but there > > > is no requirement that a lock or unlock taken separately act like a > > > full barrier. > > > > But that is already a property of the acquisition/release barrier. > > As I mentioned in my fixes for the -rt swait barrier patches I sent. Not to me you didn't ;-) > Spin locks only prevent leaks out of the critical section. It does not > guarantee leaks into the critical section, thus: What's your point? You're just re-iterating the semantics in case anybody forgot about them? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-28 13:05 ` Peter Zijlstra @ 2013-08-28 13:15 ` Steven Rostedt 2013-08-28 13:37 ` Peter Zijlstra 0 siblings, 1 reply; 18+ messages in thread From: Steven Rostedt @ 2013-08-28 13:15 UTC (permalink / raw) To: Peter Zijlstra Cc: Paul E. McKenney, Alexander Fyodorov, Waiman Long, linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J, Thomas Gleixner, Ingo Molnar On Wed, 28 Aug 2013 15:05:29 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Aug 28, 2013 at 08:59:57AM -0400, Steven Rostedt wrote: > > On Wed, 28 Aug 2013 10:19:37 +0200 > > Peter Zijlstra <peterz@infradead.org> wrote: > > > Spin locks only prevent leaks out of the critical section. It does not > > guarantee leaks into the critical section, thus: > > What's your point? You're just re-iterating the semantics in case > anybody forgot about them? I think we are all misunderstanding each other. It sounded like you didn't want to reimplement a lock to remove memory barriers. Are you for the smp_mb__after_spin_unlock() ? I'm getting confused by who is arguing what :-) -- Steve ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-28 13:15 ` Steven Rostedt @ 2013-08-28 13:37 ` Peter Zijlstra 0 siblings, 0 replies; 18+ messages in thread From: Peter Zijlstra @ 2013-08-28 13:37 UTC (permalink / raw) To: Steven Rostedt Cc: Paul E. McKenney, Alexander Fyodorov, Waiman Long, linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J, Thomas Gleixner, Ingo Molnar On Wed, Aug 28, 2013 at 09:15:30AM -0400, Steven Rostedt wrote: > Are you for the smp_mb__after_spin_unlock() ? Sure, if the performance gains of having it out-weight the added complexity of having it :-) Nice non answer methinks, no? Muwhaha. To me having one more or less 'conditional' full mb primitive really doesn't matter anymore. I'm presuming you're doing this for performance gains on anything other than x86 though, as x86 has a non-atomic unlock and thus always requires the full barrier thing here (yes yes I know ia32 still exists but people should really really really kill that thing already). > I'm getting confused by who is arguing what :-) Hehe ;-) ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-27 12:09 ` Alexander Fyodorov [not found] ` <20130827091436.3d5971a0@gandalf.local.home> @ 2013-08-29 15:24 ` Waiman Long 2013-08-29 17:03 ` Alexander Fyodorov 1 sibling, 1 reply; 18+ messages in thread From: Waiman Long @ 2013-08-29 15:24 UTC (permalink / raw) To: Alexander Fyodorov Cc: linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J, Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar On 08/27/2013 08:09 AM, Alexander Fyodorov wrote: >> I also thought that the x86 spinlock unlock path was an atomic add. It >> just comes to my realization recently that this is not the case. The >> UNLOCK_LOCK_PREFIX will be mapped to "" except for some old 32-bit x86 >> processors. > Hmm, I didn't know that. Looking through Google found these rules for x86 memory ordering: > * Loads are not reordered with other loads. > * Stores are not reordered with other stores. > * Stores are not reordered with older loads. > So x86 memory model is rather strict and memory barrier is really not needed in the unlock path - xadd is a store and thus behaves like a memory barrier, and since only lock's owner modifies "ticket.head" the "add" instruction need not be atomic. > > But this is true only for x86, other architectures have more relaxed memory ordering. Maybe we should allow arch code to redefine queue_spin_unlock()? And define a version without smp_mb() for x86? What I have been thinking is to set a flag in an architecture specific header file to tell if the architecture need a memory barrier. The generic code will then either do a smp_mb() or barrier() depending on the presence or absence of the flag. I would prefer to do more in the generic code, if possible. Regards, Longman ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-29 15:24 ` Waiman Long @ 2013-08-29 17:03 ` Alexander Fyodorov 2013-08-30 3:16 ` Waiman Long 0 siblings, 1 reply; 18+ messages in thread From: Alexander Fyodorov @ 2013-08-29 17:03 UTC (permalink / raw) To: Waiman Long Cc: linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J, Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar 29.08.2013, 19:25, "Waiman Long" <waiman.long@hp.com>: > What I have been thinking is to set a flag in an architecture specific > header file to tell if the architecture need a memory barrier. The > generic code will then either do a smp_mb() or barrier() depending on > the presence or absence of the flag. I would prefer to do more in the > generic code, if possible. If you use flag then you'll have to check it manually. It is better to add new smp_mb variant, I suggest calling it smp_mb_before_store(), and define it to barrier() on x86. But the same constraints as to UNLOCK_LOCK_PREFIX should apply here, so it will be something like this: arch/x86/include/asm/barrier.h: +#if defined(CONFIG_X86_32) && \ + (defined(CONFIG_X86_OOSTORE) || defined+(CONFIG_X86_PPRO_FENCE)) +/* + * On PPro SMP or if we are using OOSTORE, we use a full memory barrier + * (PPro errata 66, 92) + */ +# define smp_mb_before_store() smp_mb() +#else +# define smp_mb_before_store() barrier() +#endif ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-29 17:03 ` Alexander Fyodorov @ 2013-08-30 3:16 ` Waiman Long 2013-08-30 8:15 ` Alexander Fyodorov 0 siblings, 1 reply; 18+ messages in thread From: Waiman Long @ 2013-08-30 3:16 UTC (permalink / raw) To: Alexander Fyodorov Cc: linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J, Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar On 08/29/2013 01:03 PM, Alexander Fyodorov wrote: > 29.08.2013, 19:25, "Waiman Long"<waiman.long@hp.com>: >> What I have been thinking is to set a flag in an architecture specific >> header file to tell if the architecture need a memory barrier. The >> generic code will then either do a smp_mb() or barrier() depending on >> the presence or absence of the flag. I would prefer to do more in the >> generic code, if possible. > If you use flag then you'll have to check it manually. It is better to add new smp_mb variant, I suggest calling it smp_mb_before_store(), and define it to barrier() on x86. I am sorry that I was not clear in my previous mail. I mean a flag/macro for compile time checking rather than doing runtime checking. Regards, Longman ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-30 3:16 ` Waiman Long @ 2013-08-30 8:15 ` Alexander Fyodorov 0 siblings, 0 replies; 18+ messages in thread From: Alexander Fyodorov @ 2013-08-30 8:15 UTC (permalink / raw) To: Waiman Long Cc: linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J, Peter Zijlstra, Steven Rostedt, Thomas Gleixner, Ingo Molnar 30.08.2013, 07:16, "Waiman Long" <waiman.long@hp.com>: > I am sorry that I was not clear in my previous mail. I mean a flag/macro > for compile time checking rather than doing runtime checking. Still a new smp_mb_xxx is better IMO since it can be used in other places too if a need arises. ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH RFC v2 0/2] qspinlock: Introducing a 4-byte queue spinlock @ 2013-08-13 18:41 Waiman Long 2013-08-13 18:41 ` [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long 0 siblings, 1 reply; 18+ messages in thread From: Waiman Long @ 2013-08-13 18:41 UTC (permalink / raw) To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann Cc: Waiman Long, linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt, Andrew Morton, Greg Kroah-Hartman, Matt Fleming, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, Raghavendra K T, George Spelvin, Harvey Harrison, Chandramouleeswaran, Aswin, Norton, Scott J v1->v2: - Add some more comments to document what the code does. - Add a numerous CPU mode to support >= 16K CPUs - Add a configuration option to allow lock stealing which can further improve performance in many cases. - Enable wakeup of queue head CPU at unlock time for non-numerous CPU mode. This patch set introduces a queue-based spinlock implementation that can be used to replace the default ticket spinlock without increasing the size of the spinlock data structure. As a result, critical kernel data structures that embed spinlock won't increase in size and breaking data alignments. The queue spinlock has about the same performance as the ticket spinlock in uncontended case. Its performance is better with moderate to heavy contention. It may be slightly slower in light to moderate level of lock contention where the queue depth is around 1-2. This patch has the potential of improving the performance of all the workloads that have moderate to heavy spinlock contention. The queue spinlock is especially suitable for NUMA machines with at least 2 sockets, though noticeable performance benefit probably won't show up in machines with less than 4 sockets. There is also an experimental option to allow lock stealing thus breaking the FIFO guarantee of lock acquisition. However, performance usually goes up with an unfair lock. The purpose of this patch set is not to solve any particular spinlock contention problems. Those need to be solved by refactoring the code to make more efficient use of the lock or finer granularity ones. The main purpose is to make the lock contention problems more tolerable until someone can spend the time and effort to fix them. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Waiman Long (2): qspinlock: Introducing a 4-byte queue spinlock implementation qspinlock x86: Enable x86 to use queue spinlock arch/x86/Kconfig | 3 + arch/x86/include/asm/spinlock.h | 2 + arch/x86/include/asm/spinlock_types.h | 4 + arch/x86/kernel/paravirt-spinlocks.c | 10 + include/asm-generic/qspinlock.h | 207 +++++++++++++ lib/Kconfig | 25 ++ lib/Makefile | 1 + lib/qspinlock.c | 522 +++++++++++++++++++++++++++++++++ 8 files changed, 774 insertions(+), 0 deletions(-) create mode 100644 include/asm-generic/qspinlock.h create mode 100644 lib/qspinlock.c ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-13 18:41 [PATCH RFC v2 0/2] qspinlock: Introducing a 4-byte queue spinlock Waiman Long @ 2013-08-13 18:41 ` Waiman Long 0 siblings, 0 replies; 18+ messages in thread From: Waiman Long @ 2013-08-13 18:41 UTC (permalink / raw) To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann Cc: Waiman Long, linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt, Andrew Morton, Greg Kroah-Hartman, Matt Fleming, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, Raghavendra K T, George Spelvin, Harvey Harrison, Chandramouleeswaran, Aswin, Norton, Scott J This patch introduces a new queue spinlock implementation that can serve as an alternative to the default ticket spinlock. Compared with the ticket spinlock, this queue spinlock should be almost as fair as the ticket spinlock. It has about the same speed in single-thread and it can be much faster in high contention situations. Only in light to moderate contention where the average queue depth is around 1-2 will this queue spinlock be potentially a bit slower due to the higher slowpath overhead. This queue spinlock is especially suit to NUMA machines with a large number of cores as the chance of spinlock contention is much higher in those machines. The cost of contention is also higher because of slower inter-node memory traffic. The idea behind this spinlock implementation is the fact that spinlocks are acquired with preemption disabled. In other words, the process will not be migrated to another CPU while it is trying to get a spinlock. Ignoring interrupt handling, a CPU can only be contending in one spinlock at any one time. Of course, interrupt handler can try to acquire one spinlock while the interrupted user process is in the process of getting another spinlock. By allocating a set of per-cpu queue nodes and used them to form a waiting queue, we can encode the queue node address into a much smaller 16-bit size. Together with the 1-byte lock bit, this queue spinlock implementation will only need 4 bytes to hold all the information that it needs. The default queue node address encoding is as follows: Bits 0-1 : queue node index in the per-cpu array (4 entries) Bits 2-16: cpu number + 1 (max cpus = 16383) If more CPUs are to be supported (numerous CPU mode), 24 bits of the 32-bit lock will be used to support at least (64K-1) CPUs. There are additional overhead which can potentially slow it down a little bit. In the extremely unlikely case that all the queue node entries are used up, the current code will fall back to busy spinning without waiting in a queue with warning message. This patch also has an experimental configuration option (QSPINLOCK_UNFAIR) to enable lock stealing. That will break the FIFO guarantee of lock acquisition process, but it usually results in better performance when lock contention happens. The patch allows the optional replacement of architecture specific implementation of ticket spinlock by this generic version of queue spinlock. Two new config parameters are introduced: 1. QSPINLOCK A select-able option that enables the building and replacement of architecture specific spinlock by queue spinlock. 2. ARCH_QSPINLOCK Have to be defined in arch/$(ARCH)/Kconfig to enable QUEUE_SPINLOCK option. This option, by itself, will not enable the queue spinlock feature. For single-thread performance (no contention), a 256K lock/unlock loop was run on a 2.93Ghz Westmere x86-64 CPU. The following table shows the average time (in ns) for a single lock/unlock sequence (including the looping and timing overhead): Lock Type Time (ns) --------- --------- Ticket spinlock 12.2 Queue spinlock (Normal) 11.9 Queue spinlock (Numerous CPU) 7.9 So the normal queue spinlock has about the same performance as a ticket spinlock. In numerous CPU mode, the queue spinlock is much faster because of less overhead in the unlock fast path. The AIM7 benchmark was run on a 8-socket 80-core DL980 with Westmere x86-64 CPUs and HT off to evaluate the performance impact of this patch on the 3.10.5 kernel. The following modified kernels were tested: 1) Queue spinlock (normal) 2) Queue spinlock (normal) with lock stealing 3) Queue spinlock (numerous CPU) The jobs per minute (JPM) results for the 1100-2000 user ranges are shown below: +------------+---------+---------+---------+---------+ | Kernel | 3.10.5 | 3.10.5 | 3.10.5 | 3.10.5 | | | | qlock 1 | qlock 2 | qlock 3 | +------------+---------+---------+---------+---------+ |alltests | 401450 | 402184 | 406803 | 400031 | |% Change | - | +0.2% | +1.3% | -0.4% | |custom | 306544 | 305525 | 310030 | 307530 | |% Change | - | -0.3% | +1.1% | +0.3% | |dbase | 897496 | 900656 | 901985 | 900234 | |% Change | - | +0.4% | +0.5% | +0.3% | |disk | 211852 | 269357 | 347108 | 274298 | |% Change | - | +27.1% | +63.8% | +29.5% | |five_sec | 145519 | 157018 | 159820 | 155627 | |% Change | - | +7.9% | +9.8% | +7.0% | |fserver | 454512 | 468861 | 466659 | 459497 | |% Change | - | +3.2% | +2.7% | +1.1% | |high_systime| 135079 | 138337 | 140992 | 140198 | |% Change | - | +2.4% | +4.4% | +3.8% | |new_dbase | 936614 | 937732 | 937403 | 936805 | |% Change | - | +0.1% | +0.1% | 0.0% | |new_fserver | 440808 | 449432 | 450547 | 440155 | |% Change | - | +2.0% | +2.2% | -0.2% | |shared | 371142 | 368907 | 369868 | 370862 | |% Change | - | -0.6% | -0.3% | -0.1% | |short | 1058806 | 3010190 | 6058205 | 3104796 | |% Change | - | +184.3% | +472.2% | +193.2% | +------------+---------+---------+---------+---------+ Due to variability of AIM7 results, a change of a few percentage points may not indicate a real change in performance. However, the general trend is that enabling lock stealing can improve performance significantly in workloads with a lot of lock contention. The more than double in performance in the short workload is particularly impressive. As for the normal and numerous CPU modes, one was slightly better in some workloads, but slightly worse in the others. The trade-off between these two modes are as follows: 1) The normal mode enables unlock wake-up and hence has less lock cacheline contention. 2) The numerous CPU mode has less latency between the unlock operation of one CPU and the lock operation of the next one in line. Depending on which factor is more important, one will be slightly faster than the other. For the disk workload, the spinlock bottleneck is the standalone mb_cache_spinlock in fs/mbcache.c. For the short workload, the spinlock bottleneck is the d_lock in the dentry structure. The following table shows the %time used up by the locks as reported by the perf command at 1000 users for disk workload & 1500 users for short workload with HT off. --------------------------------+----------+----------+----------+ Configuration | ticket | qlock | qlock | | spinlock | fastpath | slowpath | --------------------------------+----------+----------+----------+ Disk w/o patch | 69.0% | - | - | Disk with patch | - | 0.31% | 73.2% | Disk with patch & lock stealing | - | 0.39% | 67.5% | Short w/o patch | 74.3% | - | - | Short with patch | - | 0.76% | 73.0% | Short with patch & lock stealing| - | 2.37% | 43.3% | --------------------------------+----------+----------+----------+ There are 3 observations here: 1. With lock stealing enabled, less time are spent in the lock slowpath, thus allowing the system to do more useful work. 2. The %CPU time actually increases in the disk workload with the patch without lock stealing. The fact that most of them are spinning on their own cacheline with less congestion can probably skew the number up. 3. Both the short and disk workloads spend about 70% of their time in the lock without lock stealing. However, performance improvement is 2.8X vs 1.3X. This is probably due to the fact that d_lock is embedded next to the d_count to be updated whereas mb_cache_spinlock is in a standalone cacheline not shared by the data to be updated. Signed-off-by: Waiman Long <Waiman.Long@hp.com> --- include/asm-generic/qspinlock.h | 207 ++++++++++++++++ lib/Kconfig | 25 ++ lib/Makefile | 1 + lib/qspinlock.c | 522 +++++++++++++++++++++++++++++++++++++++ 4 files changed, 755 insertions(+), 0 deletions(-) create mode 100644 include/asm-generic/qspinlock.h create mode 100644 lib/qspinlock.c diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h new file mode 100644 index 0000000..1e0403f --- /dev/null +++ b/include/asm-generic/qspinlock.h @@ -0,0 +1,207 @@ +/* + * Queue spinlock + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * (C) Copyright 2013 Hewlett-Packard Development Company, L.P. + * + * Authors: Waiman Long <waiman.long@hp.com> + */ +#ifndef __ASM_GENERIC_QSPINLOCK_H +#define __ASM_GENERIC_QSPINLOCK_H + +#include <linux/types.h> +#include <asm/cmpxchg.h> +#include <asm/barrier.h> +#include <asm/processor.h> +#include <asm/byteorder.h> + +#if !defined(__LITTLE_ENDIAN) && !defined(__BIG_ENDIAN) +#error "Missing either LITTLE_ENDIAN or BIG_ENDIAN definition." +#endif + +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#define qspinlock arch_spinlock +#endif + +/* + * There is 2 slightly different implementations of queue spinlock. + * The faster one can support only upto (16K-1) CPUs, but the lock + * word has to be 16 bits. The slower one requires a 8-bit lock word + * and can currently support (64K-1). However, it is possible to support + * more CPUs than the current limit, if necessary. + */ +#if !defined(QSPINLOCK_MANYCPUS) && (CONFIG_NR_CPUS >= (16*1024-1)) +#define QSPINLOCK_MANYCPUS +#endif + +#ifdef QSPINLOCK_MANYCPUS +#define lock8 locked +#else +#define lock16 locked +#endif + +/* + * The queue spinlock data structure + * The lock word should always be the least significant portion of the + * 32-bit word. + */ +typedef struct qspinlock { + union { + u32 qlock; + struct { +#ifdef __LITTLE_ENDIAN + union { + u8 lock8; /* 8-bit lock word */ + u16 lock16; /* 16-bit lock word */ + }; + u16 qcode; /* Wait queue code */ +#else + u16 qcode; /* Wait queue code */ + union { + struct { + u8 dummy; /* Reserved */ + u8 lock8; /* 8-bit lock word */ + }; + u16 lock16; /* 16-bit lock word */ + }; +#endif + }; + }; +} arch_spinlock_t; + +/* Remove the temporary lock8 or lock16 macro */ +#ifdef lock8 +#undef lock8 +#endif +#ifdef lock16 +#undef lock16 +#endif + +#define QSPINLOCK_LOCKED 1 + +/* + * The CONFIG_QSPINLOCK_UNFAIR config parameter breaks the FIFO fairness + * guarantee of the queue spinlock and allows some lock stealing to happen. + * This slight unfairness may increase the overall throughput and performance + * of the system in some circumstances. An example will be a virtual guest + * with over-commited CPUs. If a CPU at the queue head is suspended, other + * newly arrived CPUs wanting to get the lock may be able to get it while + * the others in the queue have to wait until the queue head process returns. + */ +#ifdef CONFIG_QSPINLOCK_UNFAIR +#define qlock_word locked +#else +#define qlock_word qlock +#endif + +/* + * External function declarations + */ +extern void queue_spin_lock_slowpath(struct qspinlock *lock); +#ifndef QSPINLOCK_MANYCPUS +extern void queue_spin_unlock_slowpath(struct qspinlock *lock, u16 qcode); +#endif + +/** + * queue_spin_is_locked - is the spinlock locked? + * @lock: Pointer to queue spinlock structure + * Return: 1 if it is locked, 0 otherwise + */ +static __always_inline int queue_spin_is_locked(struct qspinlock *lock) +{ + return ACCESS_ONCE(lock->locked); +} + +/** + * queue_spin_is_contended - check if the lock is contended + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock contended, 0 otherwise + */ +static __always_inline int queue_spin_is_contended(struct qspinlock *lock) +{ + return ACCESS_ONCE(lock->qcode); +} +/** + * queue_spin_trylock - try to acquire the queue spinlock + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock acquired, 0 if failed + */ +static __always_inline int queue_spin_trylock(struct qspinlock *lock) +{ + if (!ACCESS_ONCE(lock->qlock_word) && + (cmpxchg(&lock->qlock_word, 0, QSPINLOCK_LOCKED) == 0)) + return 1; + return 0; +} + +/** + * queue_spin_lock - acquire a queue spinlock + * @lock: Pointer to queue spinlock structure + */ +static __always_inline void queue_spin_lock(struct qspinlock *lock) +{ + /* + * To reduce memory access to only once for the cold cache case, + * a direct cmpxchg() is performed in the fastpath to optimize the + * uncontended case. The contended performance, however, may suffer + * a bit because of that. + */ + if (likely(cmpxchg(&lock->qlock_word, 0, QSPINLOCK_LOCKED) == 0)) + return; + queue_spin_lock_slowpath(lock); +} + +/** + * queue_spin_unlock - release a queue spinlock + * @lock : Pointer to queue spinlock structure + */ +static __always_inline void queue_spin_unlock(struct qspinlock *lock) +{ + /* + * This unlock function may get inlined into the caller. The + * compiler barrier is needed to make sure that the lock will + * not be released before changes to the critical section is done. + * The ending smp_wmb() allows queue head process to get the lock ASAP. + */ + barrier(); +#ifndef QSPINLOCK_MANYCPUS + { + u16 qcode = ACCESS_ONCE(lock->locked); + if (unlikely(qcode != QSPINLOCK_LOCKED)) + queue_spin_unlock_slowpath(lock, qcode); + } +#endif + ACCESS_ONCE(lock->locked) = 0; + smp_wmb(); +} + +#undef qlock_word + +/* + * Initializier + */ +#define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } } + +#ifndef CONFIG_PARAVIRT_SPINLOCKS +/* + * Remapping spinlock architecture specific functions to the corresponding + * queue spinlock functions. + */ +#define arch_spin_is_locked(l) queue_spin_is_locked(l) +#define arch_spin_is_contended(l) queue_spin_is_contended(l) +#define arch_spin_lock(l) queue_spin_lock(l) +#define arch_spin_trylock(l) queue_spin_trylock(l) +#define arch_spin_unlock(l) queue_spin_unlock(l) +#define arch_spin_lock_flags(l, f) queue_spin_lock(l) +#endif + +#endif /* __ASM_GENERIC_QSPINLOCK_H */ diff --git a/lib/Kconfig b/lib/Kconfig index 71d9f81..d0306a0 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -410,6 +410,31 @@ config SIGNATURE Implementation is done using GnuPG MPI library # +# Generic queue spinlock +# +config QSPINLOCK + bool "Use queue spinlock" + depends on ARCH_QSPINLOCK + default n + help + This option enables an alternative spinlock implementation + that has the same spinlock structure size but is more + optimized for larger NUMA systems with a lot of CPU + cores. Specifically, waiting lock spinners are put to wait + in a queue on a local cacheline rather than all spinning + on the same lock cacheline. + +config QSPINLOCK_UNFAIR + bool "Allow lock stealing in queue spinlock (EXPERIMENTAL)" + depends on QSPINLOCK + default n + help + This option allows some lock stealing to happen and + breaks the FIFO guarantee on which process will get the + lock next. This kind of unfairness may help to improve + performance for some workloads. + +# # libfdt files, only selected if needed. # config LIBFDT diff --git a/lib/Makefile b/lib/Makefile index 7baccfd..a557f8c 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -187,3 +187,4 @@ quiet_cmd_build_OID_registry = GEN $@ clean-files += oid_registry_data.c obj-$(CONFIG_UCS2_STRING) += ucs2_string.o +obj-$(CONFIG_QSPINLOCK) += qspinlock.o diff --git a/lib/qspinlock.c b/lib/qspinlock.c new file mode 100644 index 0000000..2b65bec --- /dev/null +++ b/lib/qspinlock.c @@ -0,0 +1,522 @@ +/* + * Queue spinlock + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * (C) Copyright 2013 Hewlett-Packard Development Company, L.P. + * + * Authors: Waiman Long <waiman.long@hp.com> + */ +#include <linux/smp.h> +#include <linux/bug.h> +#include <linux/cpumask.h> +#include <linux/percpu.h> +#include <linux/hardirq.h> +#include <asm-generic/qspinlock.h> + +/* + * The basic principle of a queue-based spinlock can best be understood + * by studying a classic queue-based spinlock implementation called the + * MCS lock. The paper below provides a good description for this kind + * of lock. + * + * http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf + * + * This queue spinlock implementation is based on the MCS lock with twists + * to make it fit the following constraints: + * 1. A max spinlock size of 4 bytes + * 2. Good fastpath performance + * 3. No change in the locking APIs + * + * The queue spinlock fastpath is as simple as it can get, all the heavy + * lifting is done in the lock slowpath. The main idea behind this queue + * spinlock implementation is to keep the spinlock size at 4 bytes while + * at the same time implement a queue structure to queue up the waiting + * lock spinners. + * + * Since preemption is disabled before getting the lock, a given CPU will + * only need to use one queue node structure in a non-interrupt context. + * A percpu queue node structure will be allocated for this purpose and the + * cpu number will be put into the queue spinlock structure to indicate the + * tail of the queue. + * + * To handle spinlock acquisition at interrupt context (softirq or hardirq), + * the queue node structure is actually an array for supporting nested spin + * locking operations in interrupt handlers. If all the entries in the + * array are used up, a warning message will be printed (as that shouldn't + * happen in normal circumstances) and the lock spinner will fall back to + * busy spinning instead of waiting in a queue. + * + * This queue spinlock implementation can be operated with the following 2 + * sets of orthogonal operation modes: + * 1) Numerous CPU vs. normal CPU mode + * 2) Fair vs. lock stealing mode + * + * Numerous CPU mode is on when NR_CPUS >= 16K, otherwise the code will be + * in normal CPU mode. The differences between these 2 modes are: + * 1) 24 bits of the 32-bit lock word can be used to encode the queue node + * address in numerous CPU mode whereas only 16 bits are used in normal + * mode. + * 2) The normal mode allows unlock wakeup of the queue head, whereas in the + * numerous CPU mode, the next queue head is always woken up at the end + * of lock operation. + * 3) The performance should be a little bit better in the normal CPU mode. + * + * The ability to do unlock wakeup does slow down the unlock fastpath a little + * bit. However, that eliminates the remaining CPU that is actively spinning + * on the lock cacheline hindering lock holder performance if it needs to + * access the same cacheline as the lock. The queue code of the next node + * is stored in the 16-bit lock word if one is available before the current + * CPU gets the lock. + * + * In fair mode, the queue spinlock is almost as fair as the ticket spinlock. + * There is a small window between the lock acquisition failure and the setup + * of the qcode in the lock where lock stealing can happen if the lock holder + * releases the lock during that period. Other than that, lock acquisition is + * in straight FIFO order. + * + * In lock stealing mode, lock stealing can happen whenever the lock holder + * releases the lock. This can improve performance in certain workloads + * especially if the critical section is really short. It can also help in + * virtualized guests where the queue head CPU may be suspended preventing + * others from getting the lock without lock stealing. + * + * The selection of fair or lock stealing mode is controlled by the setting + * of QSPINLOCK_UNFAIR kernel config parameter. + */ + +/* +#ifndef CONFIG_DEBUG_SPINLOCK +#define CONFIG_DEBUG_SPINLOCK 1 +#endif + */ + +/* + * The queue node structure + * + * The used flag is used for synchronization between processs and initerrupt + * contexts of the same CPU. So it should be set first at initialization and + * cleared last in the cleanup code. + */ +struct qnode { + u8 used; /* Used flag */ + u8 wait; /* Waiting flag */ +#ifndef QSPINLOCK_MANYCPUS + u16 qcode; /* Next queue node code */ +#endif + struct qnode *next; /* Next queue node addr */ +#ifdef CONFIG_DEBUG_SPINLOCK + u16 cpu_nr; /* CPU number */ + void *lock; /* Lock address */ +#endif +}; + +/* + * With 16K-1 CPUs or less, the 16-bit queue node code is divided into the + * following 2 fields: + * Bits 0-1 : queue node index (4 nodes) + * Bits 2-15: CPU number + 1 (16383 CPUs) + * + * A queue code of 0 indicates that no one is waiting on the lock. As the + * value 0 cannot be used as a valid CPU number. We need to add 1 to it + * before putting it into the queue code. + * + * With more CPUs (QSPINLOCK_MANYCPUS mode - currently limited to 64K-1), + * the queue node code is just the CPU number + 1. The bits in the 32-bit + * lock are allocated as follows: + * Bits 0-7: lock word + * Bits 8-15: queue node index (Only first 2 bits used - 4 nodes) + * Bits 16-31: CPU number + 1 (65535 CPUs) + * + * The CPU number limit can be extended even further. The only constraints + * are: + * 1) Bits 0-7 cannot be used other than for locking. + * 2) Bits 16-31 cannot be 0 if at least one CPU is queuing. + */ +#define MAX_QNODES 4 +#define MAX_CPUS (64*1024 - 1) +#ifdef QSPINLOCK_MANYCPUS +# define GET_QN_IDX(code) (((code) >> 8) & 0xff) +# define GET_CPU_NR(code) (((code) >> 16) - 1) +# define SET_QCODE(cpu, idx) ((((cpu) + 1) << 16) | ((idx) << 8) |\ + QSPINLOCK_LOCKED) +# define SET_NODE_QCODE(node, code) +# define QCODE_T u32 +# define QCODE qlock +#else +# define GET_QN_IDX(code) ((code) & 3) +# define GET_CPU_NR(code) (((code) >> 2) - 1) +# define SET_QCODE(cpu, idx) ((((cpu) + 1) << 2) | idx) +# define SET_NODE_QCODE(node, code) \ + do { ACCESS_ONCE((node)->qcode) = (code); } while (0) +# define QCODE_T u16 +# define QCODE qcode +#endif + +/* + * Per-CPU queue node structures + */ +static DEFINE_PER_CPU(struct qnode [MAX_QNODES], qnodes) ____cacheline_aligned + = { { 0 } }; + +/** + * xlate_qcode - translate the queue code into the queue node address + * @qcode: Queue code to be translated + * Return: The corresponding queue node address + */ +static inline struct qnode *xlate_qcode(QCODE_T qcode) +{ + u16 cpu_nr = GET_CPU_NR(qcode); + u16 qn_idx = GET_QN_IDX(qcode); + + return per_cpu_ptr(&qnodes[qn_idx], cpu_nr); +} + +/* + * Debugging macros & function + */ +#ifdef CONFIG_DEBUG_SPINLOCK +#define ASSERT(cond) BUG_ON(!(cond)) +#define SET_NODE(var, val) do { node->(var) = (val); } while (0) + +/** + * assert_nextnode - assert the next node is valid + * @next: Pointer to the next node + * @lock: Pointer to queue spinlock structure + * @cpu_nr: CPU number + */ +static noinline void +assert_nextnode(struct qnode *next, struct qspinlock *lock, u16 cpu_nr) +{ + ASSERT(next->lock == (void *)lock); + ASSERT(next->wait); + ASSERT(next->used); +} + + +/** + * assert_prevnode - assert the previous node is valid + * @prev: Pointer to the previous node + * @lock: Pointer to queue spinlock structure + * @cpu_nr: CPU number + */ +static noinline void +assert_prevnode(struct qnode *prev, struct qspinlock *lock, u16 cpu_nr) +{ + ASSERT(prev->cpu_nr != cpu_nr); + ASSERT(prev->lock == (void *)lock); + ASSERT(prev->used); + ASSERT(prev->next == NULL); +#ifndef QSPINLOCK_MANYCPUS + ASSERT(prev->qcode == 0); +#endif +} + +#else +#define ASSERT(cond) +#define SET_NODE(var, val) +#define assert_nextnode(next, lock, cpu_nr) +#define assert_prevnode(prev, lock, cpu_nr) +#endif + +/** + * unfair_trylock - try to acquire the lock ignoring the qcode + * @lock: Pointer to queue spinlock structure + * Return: 1 if lock acquired, 0 if failed + */ +static __always_inline int unfair_trylock(struct qspinlock *lock) +{ + if (!ACCESS_ONCE(lock->locked) && + (cmpxchg(&lock->locked, 0, QSPINLOCK_LOCKED) == 0)) + return 1; + return 0; +} + +/** + * init_node - initialize the queue node + * @node: Pointer to queue node structure + * @lock: Pointer to queue spinlock structure + * @cpu_nr: CPU number + * + * The used flag must be set before other fields. + */ +static inline void +init_node(struct qnode *node, struct qspinlock *lock, u16 cpu_nr) +{ +#ifdef QSPINLOCK_MANYCPUS + BUILD_BUG_ON(sizeof(lock->locked) != 1); +#else + BUILD_BUG_ON(sizeof(lock->locked) != 2); +#endif + ASSERT(!node->used); + node->used = true; + barrier(); + + ASSERT(!node->lock); + ASSERT(!node->next); + ASSERT(!node->wait); +#ifndef QSPINLOCK_MANYCPUS + ASSERT(!node->qcode); +#endif + node->wait = true; + node->next = NULL; + SET_NODE_QCODE(node, 0); + SET_NODE(cpu_nr, cpu_nr); + SET_NODE(lock, (void *)lock); +} + +/** + * cleanup_node - Clean up the queue node + * @node: Pointer to queue node structure + * @cpu_nr: CPU number + * + * The used flag must be the last one to be cleared. + */ +static inline void cleanup_node(struct qnode *node, u16 cpu_nr) +{ + node->next = NULL; + node->wait = false; + SET_NODE_QCODE(node, 0); + SET_NODE(lock, NULL); + ASSERT(cpu_nr == smp_processor_id()); + barrier(); + node->used = false; +} + +/** + * queue_spin_lock_slowpath - acquire the queue spinlock + * @lock: Pointer to queue spinlock structure + */ +void queue_spin_lock_slowpath(struct qspinlock *lock) +{ + unsigned int cpu_nr, qn_idx; + struct qnode *node, *next = NULL; + QCODE_T prev_qcode, my_qcode; + u16 lockcode = QSPINLOCK_LOCKED; + + /* + * Current code cannot support more than 64K-1 CPUs + */ + BUILD_BUG_ON(CONFIG_NR_CPUS > MAX_CPUS); + + /* + * Get the queue node + */ + cpu_nr = smp_processor_id(); + node = this_cpu_ptr(&qnodes[0]); + qn_idx = 0; + + if (unlikely(node->used)) { + /* + * This node has been used, try to find an empty queue + * node entry. + */ + for (qn_idx = 1; qn_idx < MAX_QNODES; qn_idx++) + if (!node[qn_idx].used) + break; + if (unlikely(qn_idx == MAX_QNODES)) { + /* + * This shouldn't happen, print a warning message + * & busy spinning on the lock. + */ + printk_sched( + "qspinlock: queue node table exhausted at cpu %d!\n", + cpu_nr); + while (!unfair_trylock(lock)) + cpu_relax(); + return; + } + /* Adjust node pointer */ + node += qn_idx; + } + + /* + * Set up the new cpu code to be exchanged + */ + my_qcode = SET_QCODE(cpu_nr, qn_idx); + + /* + * The lock may be available at this point, try again before waiting + * in a queue. + */ + if (queue_spin_trylock(lock)) + return; + + /* + * Initialize the queue node + */ + init_node(node, lock, cpu_nr); + + /* + * Exchange current copy of the queue node code + */ + prev_qcode = xchg(&lock->QCODE, my_qcode); +#ifdef QSPINLOCK_MANYCPUS + /* + * In large NR_CPUS mode, it is possible that we may accidentally + * steal the lock. If this is the case, we need to either release it + * if not the head of the queue or get the lock and be done with it. + */ + if (unlikely(!(prev_qcode & QSPINLOCK_LOCKED))) { + if (prev_qcode == 0) { + /* + * Got the lock since it is at the head of the queue + * Now try to atomically clear the queue code. + */ + if (cmpxchg(&lock->QCODE, my_qcode, QSPINLOCK_LOCKED) + == my_qcode) + goto release_node; + /* + * The cmpxchg fails only if one or more processes + * are added to the queue. In this case, we need to + * notify the next one to be the head of the queue. + */ + goto notify_next; + } + /* + * Accidentally steal the lock, release the lock and + * let the queue head get it. + */ + ACCESS_ONCE(lock->locked) = 0; + smp_wmb(); + } else + prev_qcode &= ~QSPINLOCK_LOCKED; /* Clear the lock bit */ + my_qcode &= ~QSPINLOCK_LOCKED; + ASSERT((my_qcode & 0xff) == 0); + ASSERT((prev_qcode & 0xff) == 0); +#endif + if (prev_qcode) { + /* + * Not at the queue head, get the address of the previous node + * and setup the "next" & "qcode" fields of the previous node. + */ + struct qnode *prev = xlate_qcode(prev_qcode); + + assert_prevnode(prev, lock, cpu_nr); + /* + * The queue code should be set before the next pointer or + * queue code assertion in init_node() may fail. + */ + SET_NODE_QCODE(prev, my_qcode); + ACCESS_ONCE(prev->next) = node; + smp_wmb(); + /* + * Wait until the waiting flag is off + */ + while (ACCESS_ONCE(node->wait)) + cpu_relax(); + } + + /* + * At the head of the wait queue now + */ + while (true) { + struct qspinlock old, new; + + if (unlikely(!next)) { + /* + * Get the next node & clean up current node, if + * available. + */ + next = ACCESS_ONCE(node->next); + if (next) { +#ifndef QSPINLOCK_MANYCPUS + u16 qcode = ACCESS_ONCE(node->qcode); + + while (unlikely(!qcode)) { + cpu_relax(); + qcode = ACCESS_ONCE(node->qcode); + } + lockcode = qcode; +#endif + assert_nextnode(next, lock, cpu_nr); + cleanup_node(node, cpu_nr); + node = NULL; + } + } + old.qlock = ACCESS_ONCE(lock->qlock); + if (old.locked) + ; /* Lock not available yet */ + else if (old.QCODE != my_qcode) { + /* + * Just get the lock with other waiting processes + */ + if (cmpxchg(&lock->locked, 0, lockcode) == 0) { + /* + * If lockcode is not QSPINLOCK_LOCKED, node + * has been freed and we can return without + * doing anything. + */ + if (lockcode != QSPINLOCK_LOCKED) + return; + break; /* May need to set up the next node */ + } + } else { + /* + * Get the lock & clear the queue code simultaneously + */ + new.qlock = 0; + new.locked = QSPINLOCK_LOCKED; + if (cmpxchg(&lock->qlock, old.qlock, new.qlock) + == old.qlock) { + ASSERT(!next); + goto release_node; + } + } + cpu_relax(); + } + +#ifdef QSPINLOCK_MANYCPUS +notify_next: +#endif + /* + * If the next pointer is not set, we need to wait and notify the + * next one in line to do busy spinning. + */ + if (unlikely(!next)) { + /* + * Wait until the next one in queue set up the next field + */ + while (!(next = ACCESS_ONCE(node->next))) + cpu_relax(); + assert_nextnode(next, lock, cpu_nr); + } + /* + * The next one in queue is now at the head + */ + ACCESS_ONCE(next->wait) = false; + smp_wmb(); + +release_node: + if (node) + cleanup_node(node, cpu_nr); +} +EXPORT_SYMBOL(queue_spin_lock_slowpath); + +#ifndef QSPINLOCK_MANYCPUS +/** + * queue_spin_lock_slowpath - notify the next one in the queue to get lock + * @lock : Pointer to queue spinlock structure + * @qcode: Queue code of the next node to be woken up + */ +void queue_spin_unlock_slowpath(struct qspinlock *lock, u16 qcode) +{ + struct qnode *next = xlate_qcode(qcode); + + assert_nextnode(next, lock, smp_processor_id()); + ACCESS_ONCE(next->wait) = false; + /* + * An smp_wmb() call will be issued at the end of the fastpath, so + * we don't need one here. + */ +} +EXPORT_SYMBOL(queue_spin_unlock_slowpath); +#endif -- 1.7.1 ^ permalink raw reply related [flat|nested] 18+ messages in thread
end of thread, other threads:[~2013-08-30 8:16 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <15321377012704@web8h.yandex.ru>
2013-08-21 3:01 ` [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long
2013-08-21 15:51 ` Alexander Fyodorov
2013-08-22 1:04 ` Waiman Long
2013-08-22 13:28 ` Alexander Fyodorov
2013-08-26 20:14 ` Waiman Long
2013-08-27 12:09 ` Alexander Fyodorov
[not found] ` <20130827091436.3d5971a0@gandalf.local.home>
2013-08-27 13:53 ` Peter Zijlstra
2013-08-28 1:21 ` Paul E. McKenney
2013-08-28 8:19 ` Peter Zijlstra
2013-08-28 12:59 ` Steven Rostedt
2013-08-28 13:05 ` Peter Zijlstra
2013-08-28 13:15 ` Steven Rostedt
2013-08-28 13:37 ` Peter Zijlstra
2013-08-29 15:24 ` Waiman Long
2013-08-29 17:03 ` Alexander Fyodorov
2013-08-30 3:16 ` Waiman Long
2013-08-30 8:15 ` Alexander Fyodorov
2013-08-13 18:41 [PATCH RFC v2 0/2] qspinlock: Introducing a 4-byte queue spinlock Waiman Long
2013-08-13 18:41 ` [PATCH RFC v2 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox