public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Waiman Long <longman@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>, Will Deacon <will.deacon@arm.com>,
	Catalin Marinas <catalin.marinas@arm.com>
Cc: linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, Waiman Long <longman@redhat.com>
Subject: [PATCH v2] locking/osq: Use optimized spinning loop for arm64
Date: Sun, 12 Jan 2020 18:58:54 -0500	[thread overview]
Message-ID: <20200112235854.32089-1-longman@redhat.com> (raw)

Arm64 has a more optimized spinning loop (atomic_cond_read_acquire)
for spinlock that can boost performance of sibling threads by putting
the current cpu to a shallow sleep state that is woken up only when
the monitored variable changes or an external event happens.

OSQ has a more complicated spinning loop. Besides the lock value, it
also checks for need_resched() and vcpu_is_preempted(). The check for
need_resched() is not a problem as it is only set by the tick interrupt
handler. That will be detected by the spinning cpu right after iret.

The vcpu_is_preempted() check, however, is a problem as changes to the
preempt state of of previous node will not affect the sleep state. For
ARM64, vcpu_is_preempted is not defined and so is a no-op. To guard
against future addition of vcpu_is_preempted() to arm64, code is added
to cause build error when vcpu_is_preempted becomes defined in arm64
without the corresponding changes in the OSQ spinning code.

On a 2-socket 56-core 224-thread ARM64 system, a kernel mutex locking
microbenchmark was run for 10s with and without the patch. The
performance numbers before patch were:

Running locktest with mutex [runtime = 10s, load = 1]
Threads = 224, Min/Mean/Max = 316/123,143/2,121,269
Threads = 224, Total Rate = 2,757 kop/s; Percpu Rate = 12 kop/s

After patch, the numbers were:

Running locktest with mutex [runtime = 10s, load = 1]
Threads = 224, Min/Mean/Max = 334/147,836/1,304,787
Threads = 224, Total Rate = 3,311 kop/s; Percpu Rate = 15 kop/s

So there was about 20% performance improvement.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 arch/arm64/include/asm/barrier.h | 10 ++++++++++
 kernel/locking/osq_lock.c        | 25 ++++++++++++-------------
 2 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 7d9cc5ec4971..8eb5f1239885 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -152,6 +152,16 @@ do {									\
 	VAL;								\
 })
 
+/*
+ * In osq_lock(), smp_cond_load_relaxed() is called with a condition
+ * that includes vcpu_is_preempted(). For arm64, vcpu_is_preempted is not
+ * currently defined. So it is a no-op. If vcpu_is_preempted is defined in
+ * the future, smp_cond_load_relaxed() will not response to changes in the
+ * preempt state in a timely manner. So code changes will have to be made
+ * to address this deficiency.
+ */
+#define vcpu_is_preempted_not_used
+
 #define smp_cond_load_acquire(ptr, cond_expr)				\
 ({									\
 	typeof(ptr) __PTR = (ptr);					\
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 6ef600aa0f47..69ec5161c3cc 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -13,6 +13,14 @@
  */
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
 
+/*
+ * The optimized smp_cond_load_relaxed() spin loop should not be used with
+ * vcpu_is_preempted defined.
+ */
+#if defined(vcpu_is_preempted) && defined(vcpu_is_preempted_not_used)
+#error "vcpu_is_preempted() inside smp_cond_load_relaxed() may not work!"
+#endif
+
 /*
  * We use the value 0 to represent "no CPU", thus the encoded value
  * will be the CPU number incremented by 1.
@@ -134,20 +142,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 	 * cmpxchg in an attempt to undo our queueing.
 	 */
 
-	while (!READ_ONCE(node->locked)) {
-		/*
-		 * If we need to reschedule bail... so we can block.
-		 * Use vcpu_is_preempted() to avoid waiting for a preempted
-		 * lock holder:
-		 */
-		if (need_resched() || vcpu_is_preempted(node_cpu(node->prev)))
-			goto unqueue;
-
-		cpu_relax();
-	}
-	return true;
+	if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
+				  vcpu_is_preempted(node_cpu(node->prev))))
+		return true;
 
-unqueue:
+	/* unqueue */
 	/*
 	 * Step - A  -- stabilize @prev
 	 *
-- 
2.18.1


             reply	other threads:[~2020-01-12 23:59 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-12 23:58 Waiman Long [this message]
2020-01-13  8:32 ` [PATCH v2] locking/osq: Use optimized spinning loop for arm64 yezengruan
2020-01-13 11:57 ` Will Deacon
2020-01-13 13:43   ` Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200112235854.32089-1-longman@redhat.com \
    --to=longman@redhat.com \
    --cc=catalin.marinas@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox