From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Waiman Long <waiman.long@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
Arnd Bergmann <arnd@arndb.de>,
linux-arch@vger.kernel.org, x86@kernel.org,
linux-kernel@vger.kernel.org,
Peter Zijlstra <peterz@infradead.org>,
Steven Rostedt <rostedt@goodmis.org>,
Andrew Morton <akpm@linux-foundation.org>,
Michel Lespinasse <walken@google.com>,
Andi Kleen <andi@firstfloor.org>, Rik van Riel <riel@redhat.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>,
George Spelvin <linux@horizon.com>,
Tim Chen <tim.c.chen@linux.intel.com>,
aswin@hp.com, Scott J Norton <scott.norton@hp.com>
Subject: Re: [PATCH v5 1/4] qrwlock: A queue read/write lock implementation
Date: Fri, 8 Nov 2013 15:51:08 -0800 [thread overview]
Message-ID: <20131108235108.GA22696@linux.vnet.ibm.com> (raw)
In-Reply-To: <527D675C.2040008@hp.com>
On Fri, Nov 08, 2013 at 05:36:12PM -0500, Waiman Long wrote:
> On 11/08/2013 04:11 PM, Paul E. McKenney wrote:
> >On Mon, Nov 04, 2013 at 12:17:17PM -0500, Waiman Long wrote:
> >>Kernel JPM %Change from (1)
> >>------ --- ----------------
> >> 1 148265 -
> >> 2 238715 +61%
> >> 3 242048 +63%
> >> 4 234881 +58%
> >>
> >>The use of unfair qrwlock provides a small boost of 2%, while using
> >>fair qrwlock leads to 3% decrease of performance. However, looking
> >>at the perf profiles, we can clearly see that other bottlenecks were
> >>constraining the performance improvement.
> >>
> >>Perf profile of kernel (2):
> >>
> >> 18.20% reaim [kernel.kallsyms] [k] __write_lock_failed
> >> 9.36% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> >> 2.91% reaim [kernel.kallsyms] [k] mspin_lock
> >> 2.73% reaim [kernel.kallsyms] [k] anon_vma_interval_tree_insert
> >> 2.23% ls [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> >> 1.29% reaim [kernel.kallsyms] [k] __read_lock_failed
> >> 1.21% true [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> >> 1.14% reaim [kernel.kallsyms] [k] zap_pte_range
> >> 1.13% reaim [kernel.kallsyms] [k] _raw_spin_lock
> >> 1.04% reaim [kernel.kallsyms] [k] mutex_spin_on_owner
> >>
> >>Perf profile of kernel (3):
> >>
> >> 10.57% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> >> 7.98% reaim [kernel.kallsyms] [k] queue_write_lock_slowpath
> >> 5.83% reaim [kernel.kallsyms] [k] mspin_lock
> >> 2.86% ls [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> >> 2.71% reaim [kernel.kallsyms] [k] anon_vma_interval_tree_insert
> >> 1.52% true [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> >> 1.51% reaim [kernel.kallsyms] [k] queue_read_lock_slowpath
> >> 1.35% reaim [kernel.kallsyms] [k] mutex_spin_on_owner
> >> 1.12% reaim [kernel.kallsyms] [k] zap_pte_range
> >> 1.06% reaim [kernel.kallsyms] [k] perf_event_aux_ctx
> >> 1.01% reaim [kernel.kallsyms] [k] perf_event_aux
> >But wouldn't kernel (4) be the one that was the most highly constrained?
> >
> >(That said, yes, I get that _raw_spin_lock_irqsave() is some lock that
> >is unrelated to the qrwlock.)
>
> I think the performance data is a bit off as it was collected with a
> previous version that has a minor bug in it. I will rerun the test
> to get the new data.
>
> >
> >+/**
> >+ * queue_write_can_lock- would write_trylock() succeed?
> >+ * @lock: Pointer to queue rwlock structure
> >+ */
> >+static inline int queue_write_can_lock(struct qrwlock *lock)
> >+{
> >+ union qrwcnts rwcnts;
> >+
> >+ rwcnts.rw = ACCESS_ONCE(lock->cnts.rw);
> >+ return !rwcnts.writer&& !rwcnts.readers;
> >+}
> >+
> >+/**
> >+ * queue_read_trylock - try to acquire read lock of a queue rwlock
> >+ * @lock : Pointer to queue rwlock structure
> >+ * Return: 1 if lock acquired, 0 if failed
> >+ */
> >+static inline int queue_read_trylock(struct qrwlock *lock)
> >+{
> >+ union qrwcnts cnts;
> >+ u8 wmask;
> >+
> >+ cnts.rw = ACCESS_ONCE(lock->cnts.rw);
> >+ wmask = cnts.fair ? QW_MASK_FAIR : QW_MASK_UNFAIR;
> >+ if (likely(!(cnts.writer& wmask))) {
> >+ cnts.rw = xadd(&lock->cnts.rw, QRW_READER_BIAS);
> >On an unfair lock, this can momentarily make queue_read_can_lock() give
> >a false positive. Not sure that this is a problem -- after all, the
> >return value from queue_read_can_lock() is immediately obsolete anyway.
>
> Yes, this is an issue. However, I don't this is a big deal as you
> said. Using cmpxchg may avoid this issue, but then their will be a
> fair chance of false collision among readers. So it is probably
> something we may have to live with.
>
> >>+/**
> >>+ * queue_write_unlock - release write lock of a queue rwlock
> >>+ * @lock : Pointer to queue rwlock structure
> >>+ */
> >>+static inline void queue_write_unlock(struct qrwlock *lock)
> >>+{
> >>+ /*
> >>+ * Make sure that none of the critical section will be leaked out.
> >>+ */
> >>+ smp_mb__before_clear_bit();
> >>+ ACCESS_ONCE(lock->cnts.writer) = 0;
> >>+ smp_mb__after_clear_bit();
> >How about the new smp_store_release() for this write? Looks to me that
> >smp_mb__before_clear_bit() and smp_mb__after_clear_bit() work by accident,
> >if they in fact do work for all architectures.
>
> Yes, I am thinking about updating the patch to use
> smp_store_release() once it is in. I don't want to create such a
> dependency for this patch yet. Anyway, clearing a bit looks the same
> as clearing a byte to me.
>
> >>+/*
> >>+ * Compared with regular rwlock, the queue rwlock has has the following
> >>+ * advantages:
> >>+ * 1. It is more deterministic for the fair variant. Even though there is
> >>+ * a slight chance of stealing the lock if come at the right moment, the
> >>+ * granting of the lock is mostly in FIFO order. Even the default unfair
> >>+ * variant is fairer at least among the writers.
> >>+ * 2. It is faster in high contention situation.
> >Sometimes, anyway! (Referring to your performance results on top of
> >Ingo's patch.)
>
> Will adjust the wording by adding "usually".
>
> >
> >+/**
> >+ * wait_in_queue - Add to queue and wait until it is at the head
> >+ * @lock: Pointer to queue rwlock structure
> >+ * @node: Node pointer to be added to the queue
> >+ *
> >+ * The use of smp_wmb() is to make sure that the other CPUs see the change
> >+ * ASAP.
> >+ */
> >+static __always_inline void
> >+wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
> >+{
> >+ struct qrwnode *prev;
> >+
> >+ node->next = NULL;
> >+ node->wait = true;
> >+ prev = xchg(&lock->waitq, node);
> >+ if (prev) {
> >+ prev->next = node;
> >+ smp_wmb();
> >This smp_wmb() desperately needs a comment. Presumably it is ordering
> >the above "prev->next = node" with some later write, but what write?
> >
> >Oh... I see the header comment above.
> >
> >Actually, memory barriers don't necessarily make things visible sooner.
> >They are instead used for ordering. Or did you actually measure a
> >performance increase with this? (Seems -highly- unlikely given smp_wmb()'s
> >definition on x86...)
>
> I have some incorrect assumptions about memory barrier. Anyway, this
> issue will be gone once I use the MCS lock/unlock code.
Here is a presentation that has some diagrams that might help:
http://www.rdrop.com/users/paulmck/scalability/paper/Scaling.2013.10.25c.pdf
So, for example, if X and Y are both initially zero:
CPU 0 CPU 1
ACCESS_ONCE(X) = 1; r1 = ACCESS_ONCE(Y);
smp_wmb(); smp_rmb();
ACCESS_ONCE(Y) = 1; r2 = ACCESS_ONCE(X);
Then the two memory barriers enforce a conditional ordering. The
condition is whether or not CPU 0's store to Y is seen by CPU 1's
load from Y. If it is, then the pair of memory barriers ensure that
CPU 1's load from X sees the result of CPU 0's store to X. In other
words, BUG_ON(r1 == 1 && r2 == 0) will never fire.
In general, if a memory access after memory barrier A happens before
a memory access before memory barrier B, then the two memory barriers
will ensure that applicable accesses before memory barrier A happen
before applicable accesses after memory barrier B.
Does that help?
Thanx, Paul
> >>+ /*
> >>+ * Wait until the waiting flag is off
> >>+ */
> >>+ while (ACCESS_ONCE(node->wait))
> >>+ cpu_relax();
> >>+ }
> >>+}
> >>+
> >>+/**
> >>+ * signal_next - Signal the next one in queue to be at the head
> >>+ * @lock: Pointer to queue rwlock structure
> >>+ * @node: Node pointer to the current head of queue
> >>+ */
> >>+static __always_inline void
> >>+signal_next(struct qrwlock *lock, struct qrwnode *node)
> >>+{
> >>+ struct qrwnode *next;
> >>+
> >>+ /*
> >>+ * Try to notify the next node first without disturbing the cacheline
> >>+ * of the lock. If that fails, check to see if it is the last node
> >>+ * and so should clear the wait queue.
> >>+ */
> >>+ next = ACCESS_ONCE(node->next);
> >>+ if (likely(next))
> >>+ goto notify_next;
> >>+
> >>+ /*
> >>+ * Clear the wait queue if it is the last node
> >>+ */
> >>+ if ((ACCESS_ONCE(lock->waitq) == node)&&
> >>+ (cmpxchg(&lock->waitq, node, NULL) == node))
> >>+ return;
> >>+ /*
> >>+ * Wait until the next one in queue set up the next field
> >>+ */
> >>+ while (likely(!(next = ACCESS_ONCE(node->next))))
> >>+ cpu_relax();
> >>+ /*
> >>+ * The next one in queue is now at the head
> >>+ */
> >>+notify_next:
> >>+ barrier();
> >>+ ACCESS_ONCE(next->wait) = false;
> >>+ smp_wmb();
> >Because smp_wmb() does not order reads, reads from the critical section
> >could leak out of the critical section. A full memory barrier (smp_mb())
> >seems necessary to avoid this.
> >
> >Yes, you do have full memory barriers implicit in various atomic operations,
> >but it appears to be possible to avoid them all in some situations.
>
> Yes, you are right.
>
> >>+/**
> >>+ * queue_write_3step_lock - acquire write lock in 3 steps
> >>+ * @lock : Pointer to queue rwlock structure
> >>+ * Return: 1 if lock acquired, 0 otherwise
> >>+ *
> >>+ * Step 1 - Try to acquire the lock directly if no reader is present
> >>+ * Step 2 - Set the waiting flag to notify readers that a writer is waiting
> >>+ * Step 3 - When the readers field goes to 0, set the locked flag
> >>+ *
> >>+ * When not in fair mode, the readers actually ignore the second step.
> >>+ * However, this is still necessary to force other writers to fall in line.
> >>+ */
> >>+static __always_inline int queue_write_3step_lock(struct qrwlock *lock)
> >>+{
> >>+ union qrwcnts old, new;
> >>+
> >>+ old.rw = ACCESS_ONCE(lock->cnts.rw);
> >>+
> >>+ /* Step 1 */
> >>+ if (!old.writer& !old.readers) {
> >>+ new.rw = old.rw;
> >>+ new.writer = QW_LOCKED;
> >>+ if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw) == old.rw))
> >>+ return 1;
> >>+ }
> >>+
> >>+ /* Step 2 */
> >>+ if (old.writer || (cmpxchg(&lock->cnts.writer, 0, QW_WAITING) != 0))
> >>+ return 0;
> >>+
> >>+ /* Step 3 */
> >>+ while (true) {
> >>+ cpu_relax();
> >>+ old.rw = ACCESS_ONCE(lock->cnts.rw);
> >Suppose that there now is a writer, but no readers...
> >
> >>+ if (!old.readers) {
> >>+ new.rw = old.rw;
> >>+ new.writer = QW_LOCKED;
> >>+ if (likely(cmpxchg(&lock->cnts.rw, old.rw, new.rw)
> >>+ == old.rw))
> >... can't this mistakenly hand out the lock to a second writer?
> >
> >Ah, the trick is that we are at the head of the queue, so the only writer
> >we can possibly contend with is a prior holder of the lock. Once that
> >writer leaves, no other writer but can appear. And the QW_WAITING bit
> >prevents new writers from immediately grabbing the lock.
>
> Yes, that is the point. The current has 2 queue for writers - a one
> position queue in the waiting bit and the MCS locking queue that are
> used by both readers and writers.
>
> >>+ return 1;
> >>+ }
> >>+ }
> >>+ /* Should never reach here */
> >>+ return 0;
> >>+}
> >>+
> >>+/**
> >>+ * queue_write_lock_slowpath - acquire write lock of a queue rwlock
> >>+ * @lock : Pointer to queue rwlock structure
> >>+ */
> >>+void queue_write_lock_slowpath(struct qrwlock *lock)
> >>+{
> >>+ struct qrwnode node;
> >>+
> >>+ /*
> >>+ * Put the writer into the wait queue
> >>+ */
> >>+ wait_in_queue(lock,&node);
> >>+
> >>+ /*
> >>+ * At the head of the wait queue now, call queue_write_3step_lock()
> >>+ * to acquire the lock until it is done.
> >>+ */
> >>+ while (!queue_write_3step_lock(lock))
> >>+ cpu_relax();
> >If we get here, queue_write_3step_lock() just executed a successful
> >cmpxchg(), which implies a full memory barrier. This prevents the
> >critical section from leaking out, good!
> >
> >>+ signal_next(lock,&node);
> >>+}
> >>+EXPORT_SYMBOL(queue_write_lock_slowpath);
> >>--
> >>1.7.1
> >>
> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at http://vger.kernel.org/majordomo-info.html
> >Please read the FAQ at http://www.tux.org/lkml/
>
> Thank for the review.
>
> -Longman
>
next prev parent reply other threads:[~2013-11-08 23:52 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-11-04 17:17 [PATCH v5 0/4] qrwlock: Introducing a queue read/write lock implementation Waiman Long
2013-11-04 17:17 ` [PATCH v5 1/4] qrwlock: A " Waiman Long
2013-11-08 21:11 ` Paul E. McKenney
2013-11-08 22:36 ` Waiman Long
2013-11-08 22:36 ` Waiman Long
2013-11-08 23:51 ` Paul E. McKenney [this message]
2013-11-09 3:05 ` Waiman Long
2013-11-04 17:17 ` [PATCH v5 2/4] qrwlock x86: Enable x86 to use queue read/write lock Waiman Long
2013-11-04 17:17 ` [PATCH v5 3/4] qrwlock: Enable fair " Waiman Long
2013-11-04 17:17 ` [PATCH v5 4/4] qrwlock: Use the mcs_spinlock helper functions for MCS queuing Waiman Long
2013-11-08 21:21 ` Paul E. McKenney
2013-11-08 22:42 ` Waiman Long
2013-11-08 22:42 ` Waiman Long
2013-11-09 1:17 ` Tim Chen
2013-11-09 3:07 ` Paul E. McKenney
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20131108235108.GA22696@linux.vnet.ibm.com \
--to=paulmck@linux.vnet.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=andi@firstfloor.org \
--cc=arnd@arndb.de \
--cc=aswin@hp.com \
--cc=hpa@zytor.com \
--cc=linux-arch@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux@horizon.com \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=raghavendra.kt@linux.vnet.ibm.com \
--cc=riel@redhat.com \
--cc=rostedt@goodmis.org \
--cc=scott.norton@hp.com \
--cc=tglx@linutronix.de \
--cc=tim.c.chen@linux.intel.com \
--cc=torvalds@linux-foundation.org \
--cc=waiman.long@hp.com \
--cc=walken@google.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.