From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758435Ab3KHWnG (ORCPT ); Fri, 8 Nov 2013 17:43:06 -0500 Received: from g9t1613g.houston.hp.com ([15.240.0.71]:57159 "EHLO g9t1613g.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757557Ab3KHWnE (ORCPT ); Fri, 8 Nov 2013 17:43:04 -0500 X-Greylist: delayed 360 seconds by postgrey-1.27 at vger.kernel.org; Fri, 08 Nov 2013 17:43:03 EST Message-ID: <527D675C.2040008@hp.com> Date: Fri, 08 Nov 2013 17:36:12 -0500 From: Waiman Long User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20130109 Thunderbird/10.0.12 MIME-Version: 1.0 To: paulmck@linux.vnet.ibm.com CC: Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Arnd Bergmann , linux-arch@vger.kernel.org, x86@kernel.org, linux-kernel@vger.kernel.org, Peter Zijlstra , Steven Rostedt , Andrew Morton , Michel Lespinasse , Andi Kleen , Rik van Riel , Linus Torvalds , Raghavendra K T , George Spelvin , Tim Chen , "" , Scott J Norton Subject: Re: [PATCH v5 1/4] qrwlock: A queue read/write lock implementation References: <1383585440-4391-1-git-send-email-Waiman.Long@hp.com> <1383585440-4391-2-git-send-email-Waiman.Long@hp.com> <20131108211123.GH18245@linux.vnet.ibm.com> In-Reply-To: <20131108211123.GH18245@linux.vnet.ibm.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/08/2013 04:11 PM, Paul E. McKenney wrote: > On Mon, Nov 04, 2013 at 12:17:17PM -0500, Waiman Long wrote: >> Kernel JPM %Change from (1) >> ------ --- ---------------- >> 1 148265 - >> 2 238715 +61% >> 3 242048 +63% >> 4 234881 +58% >> >> The use of unfair qrwlock provides a small boost of 2%, while using >> fair qrwlock leads to 3% decrease of performance. However, looking >> at the perf profiles, we can clearly see that other bottlenecks were >> constraining the performance improvement. >> >> Perf profile of kernel (2): >> >> 18.20% reaim [kernel.kallsyms] [k] __write_lock_failed >> 9.36% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave >> 2.91% reaim [kernel.kallsyms] [k] mspin_lock >> 2.73% reaim [kernel.kallsyms] [k] anon_vma_interval_tree_insert >> 2.23% ls [kernel.kallsyms] [k] _raw_spin_lock_irqsave >> 1.29% reaim [kernel.kallsyms] [k] __read_lock_failed >> 1.21% true [kernel.kallsyms] [k] _raw_spin_lock_irqsave >> 1.14% reaim [kernel.kallsyms] [k] zap_pte_range >> 1.13% reaim [kernel.kallsyms] [k] _raw_spin_lock >> 1.04% reaim [kernel.kallsyms] [k] mutex_spin_on_owner >> >> Perf profile of kernel (3): >> >> 10.57% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave >> 7.98% reaim [kernel.kallsyms] [k] queue_write_lock_slowpath >> 5.83% reaim [kernel.kallsyms] [k] mspin_lock >> 2.86% ls [kernel.kallsyms] [k] _raw_spin_lock_irqsave >> 2.71% reaim [kernel.kallsyms] [k] anon_vma_interval_tree_insert >> 1.52% true [kernel.kallsyms] [k] _raw_spin_lock_irqsave >> 1.51% reaim [kernel.kallsyms] [k] queue_read_lock_slowpath >> 1.35% reaim [kernel.kallsyms] [k] mutex_spin_on_owner >> 1.12% reaim [kernel.kallsyms] [k] zap_pte_range >> 1.06% reaim [kernel.kallsyms] [k] perf_event_aux_ctx >> 1.01% reaim [kernel.kallsyms] [k] perf_event_aux > But wouldn't kernel (4) be the one that was the most highly constrained? > > (That said, yes, I get that _raw_spin_lock_irqsave() is some lock that > is unrelated to the qrwlock.) I think the performance data is a bit off as it was collected with a previous version that has a minor bug in it. I will rerun the test to get the new data. > > +/** > + * queue_write_can_lock- would write_trylock() succeed? > + * @lock: Pointer to queue rwlock structure > + */ > +static inline int queue_write_can_lock(struct qrwlock *lock) > +{ > + union qrwcnts rwcnts; > + > + rwcnts.rw = ACCESS_ONCE(lock->cnts.rw); > + return !rwcnts.writer&& !rwcnts.readers; > +} > + > +/** > + * queue_read_trylock - try to acquire read lock of a queue rwlock > + * @lock : Pointer to queue rwlock structure > + * Return: 1 if lock acquired, 0 if failed > + */ > +static inline int queue_read_trylock(struct qrwlock *lock) > +{ > + union qrwcnts cnts; > + u8 wmask; > + > + cnts.rw = ACCESS_ONCE(lock->cnts.rw); > + wmask = cnts.fair ? QW_MASK_FAIR : QW_MASK_UNFAIR; > + if (likely(!(cnts.writer& wmask))) { > + cnts.rw = xadd(&lock->cnts.rw, QRW_READER_BIAS); > On an unfair lock, this can momentarily make queue_read_can_lock() give > a false positive. Not sure that this is a problem -- after all, the > return value from queue_read_can_lock() is immediately obsolete anyway. Yes, this is an issue. However, I don't this is a big deal as you said. Using cmpxchg may avoid this issue, but then their will be a fair chance of false collision among readers. So it is probably something we may have to live with. >> +/** >> + * queue_write_unlock - release write lock of a queue rwlock >> + * @lock : Pointer to queue rwlock structure >> + */ >> +static inline void queue_write_unlock(struct qrwlock *lock) >> +{ >> + /* >> + * Make sure that none of the critical section will be leaked out. >> + */ >> + smp_mb__before_clear_bit(); >> + ACCESS_ONCE(lock->cnts.writer) = 0; >> + smp_mb__after_clear_bit(); > How about the new smp_store_release() for this write? Looks to me that > smp_mb__before_clear_bit() and smp_mb__after_clear_bit() work by accident, > if they in fact do work for all architectures. Yes, I am thinking about updating the patch to use smp_store_release() once it is in. I don't want to create such a dependency for this patch yet. Anyway, clearing a bit looks the same as clearing a byte to me. >> +/* >> + * Compared with regular rwlock, the queue rwlock has has the following >> + * advantages: >> + * 1. It is more deterministic for the fair variant. Even though there is >> + * a slight chance of stealing the lock if come at the right moment, the >> + * granting of the lock is mostly in FIFO order. Even the default unfair >> + * variant is fairer at least among the writers. >> + * 2. It is faster in high contention situation. > Sometimes, anyway! (Referring to your performance results on top of > Ingo's patch.) Will adjust the wording by adding "usually". > > +/** > + * wait_in_queue - Add to queue and wait until it is at the head > + * @lock: Pointer to queue rwlock structure > + * @node: Node pointer to be added to the queue > + * > + * The use of smp_wmb() is to make sure that the other CPUs see the change > + * ASAP. > + */ > +static __always_inline void > +wait_in_queue(struct qrwlock *lock, struct qrwnode *node) > +{ > + struct qrwnode *prev; > + > + node->next = NULL; > + node->wait = true; > + prev = xchg(&lock->waitq, node); > + if (prev) { > + prev->next = node; > + smp_wmb(); > This smp_wmb() desperately needs a comment. Presumably it is ordering > the above "prev->next = node" with some later write, but what write? > > Oh... I see the header comment above. > > Actually, memory barriers don't necessarily make things visible sooner. > They are instead used for ordering. Or did you actually measure a > performance increase with this? (Seems -highly- unlikely given smp_wmb()'s > definition on x86...) I have some incorrect assumptions about memory barrier. Anyway, this issue will be gone once I use the MCS lock/unlock code. >> + /* >> + * Wait until the waiting flag is off >> + */ >> + while (ACCESS_ONCE(node->wait)) >> + cpu_relax(); >> + } >> +} >> + >> +/** >> + * signal_next - Signal the next one in queue to be at the head >> + * @lock: Pointer to queue rwlock structure >> + * @node: Node pointer to the current head of queue >> + */ >> +static __always_inline void >> +signal_next(struct qrwlock *lock, struct qrwnode *node) >> +{ >> + struct qrwnode *next; >> + >> + /* >> + * Try to notify the next node first without disturbing the cacheline >> + * of the lock. If that fails, check to see if it is the last node >> + * and so should clear the wait queue. >> + */ >> + next = ACCESS_ONCE(node->next); >> + if (likely(next)) >> + goto notify_next; >> + >> + /* >> + * Clear the wait queue if it is the last node >> + */ >> + if ((ACCESS_ONCE(lock->waitq) == node)&& >> + (cmpxchg(&lock->waitq, node, NULL) == node)) >> + return; >> + /* >> + * Wait until the next one in queue set up the next field >> + */ >> + while (likely(!(next = ACCESS_ONCE(node->next)))) >> + cpu_relax(); >> + /* >> + * The next one in queue is now at the head >> + */ >> +notify_next: >> + barrier(); >> + ACCESS_ONCE(next->wait) = false; >> + smp_wmb(); > Because smp_wmb() does not order reads, reads from the critical section > could leak out of the critical section. A full memory barrier (smp_mb()) > seems necessary to avoid this. > > Yes, you do have full memory barriers implicit in various atomic operations, > but it appears to be possible to avoid them all in some situations. Yes, you are right. >> +/** >> + * queue_write_3step_lock - acquire write lock in 3 steps >> + * @lock : Pointer to queue rwlock structure >> + * Return: 1 if lock acquired, 0 otherwise >> + * >> + * Step 1 - Try to acquire the lock directly if no reader is present >> + * Step 2 - Set the waiting flag to notify readers that a writer is waiting >> + * Step 3 - When the readers field goes to 0, set the locked flag >> + * >> + * When not in fair mode, the readers actually ignore the second step. >> + * However, this is still necessary to force other writers to fall in line. >> + */ >> +static __always_inline int queue_write_3step_lock(struct qrwlock *lock) >> +{ >> + union qrwcnts old, new; >> + >> + old.rw = ACCESS_ONCE(lock->cnts.rw); >> + >> + /* Step 1 */ >> + if (!old.writer& !old.readers) { >> + new.rw = old.rw; >> + new.writer = QW_LOCKED; >> + if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw) == old.rw)) >> + return 1; >> + } >> + >> + /* Step 2 */ >> + if (old.writer || (cmpxchg(&lock->cnts.writer, 0, QW_WAITING) != 0)) >> + return 0; >> + >> + /* Step 3 */ >> + while (true) { >> + cpu_relax(); >> + old.rw = ACCESS_ONCE(lock->cnts.rw); > Suppose that there now is a writer, but no readers... > >> + if (!old.readers) { >> + new.rw = old.rw; >> + new.writer = QW_LOCKED; >> + if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw) >> + == old.rw)) > ... can't this mistakenly hand out the lock to a second writer? > > Ah, the trick is that we are at the head of the queue, so the only writer > we can possibly contend with is a prior holder of the lock. Once that > writer leaves, no other writer but can appear. And the QW_WAITING bit > prevents new writers from immediately grabbing the lock. Yes, that is the point. The current has 2 queue for writers - a one position queue in the waiting bit and the MCS locking queue that are used by both readers and writers. >> + return 1; >> + } >> + } >> + /* Should never reach here */ >> + return 0; >> +} >> + >> +/** >> + * queue_write_lock_slowpath - acquire write lock of a queue rwlock >> + * @lock : Pointer to queue rwlock structure >> + */ >> +void queue_write_lock_slowpath(struct qrwlock *lock) >> +{ >> + struct qrwnode node; >> + >> + /* >> + * Put the writer into the wait queue >> + */ >> + wait_in_queue(lock,&node); >> + >> + /* >> + * At the head of the wait queue now, call queue_write_3step_lock() >> + * to acquire the lock until it is done. >> + */ >> + while (!queue_write_3step_lock(lock)) >> + cpu_relax(); > If we get here, queue_write_3step_lock() just executed a successful > cmpxchg(), which implies a full memory barrier. This prevents the > critical section from leaking out, good! > >> + signal_next(lock,&node); >> +} >> +EXPORT_SYMBOL(queue_write_lock_slowpath); >> -- >> 1.7.1 >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ Thank for the review. -Longman