[RFC PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
@ 2013-01-03  5:15 Rik van Riel
  2013-01-03  5:18 ` [RFC PATCH 1/5] x86,smp: move waiting on contended ticket lock out of line Rik van Riel
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Rik van Riel @ 2013-01-03  5:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, walken, eric.dumazet, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner, knoel

[-- Attachment #1: Type: text/plain, Size: 2923 bytes --]

Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper "Non-scalable locks are dangerous" is a good reference:

	http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

The test case has a number of threads locking and unlocking a
semaphore. With just one thread, everything sits in the CPU
cache and throughput is around 2.6 million operations per
second, with a 5-10% variation.

Once a second thread gets involved, data structures bounce
from CPU to CPU, and performance deteriorates to about 1.25
million operations per second, with a 5-10% variation.

However, as more and more threads get added to the mix,
performance with the vanilla kernel continues to deteriorate.
Once I hit 24 threads, on a 24 CPU, 4 node test system,
performance is down to about 290k operations/second.

With a proportional backoff delay added to the spinlock
code, performance with 24 threads goes up to about 400k
operations/second with a 50x delay, and about 900k operations/second
with a 250x delay. However, with a 250x delay, performance with
2-5 threads is worse than with a 50x delay.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention,
and should also protect against the (likely) case of the fixed
delay factor being wrong for other hardware.

The attached graph shows the performance of the multi threaded
semaphore lock/unlock test case, with 1-24 threads, on the
vanilla kernel, with 10x, 50x, and 250x proportional delay,
as well as the v1 patch series with autotuning for 2x and 2.7x
spinning before the lock is obtained, and with the v2 series.

The v2 series integrates several ideas from Michel Lespinasse
and Eric Dumazet, which should result in better throughput and
nicer behaviour in situations with contention on multiple locks.

Please let me know if you manage to break this code in any way,
so I can fix it...

-- 
All rights reversed.

[-- Attachment #2: spinlock-backoff-v2.png --]
[-- Type: image/png, Size: 21964 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 1/5] x86,smp: move waiting on contended ticket lock out of line
  2013-01-03  5:15 [RFC PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
@ 2013-01-03  5:18 ` Rik van Riel
  2013-01-03 10:47   ` Michel Lespinasse
  2013-01-03  5:22 ` [RFC PATCH 2/5] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Rik van Riel @ 2013-01-03  5:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, walken, eric.dumazet, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner, knoel

Moving the wait loop for congested loops to its own function allows
us to add things to that wait loop, without growing the size of the
kernel text appreciably.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmiss.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Rafael Aquini <aquini@redhat.com>
---
v2: clean up the code a little, after Michel's suggestion 

 arch/x86/include/asm/spinlock.h |   11 +++++------
 arch/x86/kernel/smp.c           |   14 ++++++++++++++
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 33692ea..dc492f6 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -34,6 +34,8 @@
 # define UNLOCK_LOCK_PREFIX
 #endif
 
+extern void ticket_spin_lock_wait(arch_spinlock_t *, struct __raw_tickets);
+
 /*
  * Ticket locks are conceptually two parts, one indicating the current head of
  * the queue, and the other indicating the current tail. The lock is acquired
@@ -53,12 +55,9 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
 
 	inc = xadd(&lock->tickets, inc);
 
-	for (;;) {
-		if (inc.head == inc.tail)
-			break;
-		cpu_relax();
-		inc.head = ACCESS_ONCE(lock->tickets.head);
-	}
+	if (inc.head != inc.tail)
+		ticket_spin_lock_wait(lock, inc);
+
 	barrier();		/* make sure nothing creeps before the lock is taken */
 }
 
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 48d2b7d..20da354 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -113,6 +113,20 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
 static bool smp_no_nmi_ipi = false;
 
 /*
+ * Wait on a congested ticket spinlock.
+ */
+void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
+{
+	for (;;) {
+		cpu_relax();
+		inc.head = ACCESS_ONCE(lock->tickets.head);
+
+		if (inc.head == inc.tail)
+			break;
+	}
+}
+
+/*
  * this function sends a 'reschedule' IPI to another CPU.
  * it goes straight through and wastes no time serializing
  * anything. Worst case is that we lose a reschedule ...


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 2/5] x86,smp: proportional backoff for ticket spinlocks
  2013-01-03  5:15 [RFC PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
  2013-01-03  5:18 ` [RFC PATCH 1/5] x86,smp: move waiting on contended ticket lock out of line Rik van Riel
@ 2013-01-03  5:22 ` Rik van Riel
  2013-01-03 11:35   ` Raghavendra KT
  2013-01-03  5:23 ` [RFC PATCH 3/5] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Rik van Riel @ 2013-01-03  5:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, walken, eric.dumazet, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner, knoel

Simple fixed value proportional backoff for ticket spinlocks.
By pounding on the cacheline with the spin lock less often,
bus traffic is reduced. In cases of a data structure with
embedded spinlock, the lock holder has a better chance of
making progress.

If we are next in line behind the current holder of the
lock, we do a fast spin, so as not to waste any time when
the lock is released.

The number 50 is likely to be wrong for many setups, and
this patch is mostly to illustrate the concept of proportional
backup. The next patch automatically tunes the delay value.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Michel Lespinasse <walken@google.com>
---
 arch/x86/kernel/smp.c |   23 ++++++++++++++++++++---
 1 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 20da354..9c56fe3 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -117,11 +117,28 @@ static bool smp_no_nmi_ipi = false;
  */
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
+	__ticket_t head = inc.head, ticket = inc.tail;
+	__ticket_t waiters_ahead;
+	unsigned loops;
+
 	for (;;) {
-		cpu_relax();
-		inc.head = ACCESS_ONCE(lock->tickets.head);
+		waiters_ahead = ticket - head - 1;
+		/*
+		 * We are next after the current lock holder. Check often
+		 * to avoid wasting time when the lock is released.
+		 */
+		if (!waiters_ahead) {
+			do {
+				cpu_relax();
+			} while (ACCESS_ONCE(lock->tickets.head) != ticket);
+			break;
+		}
+		loops = 50 * waiters_ahead;
+		while (loops--)
+			cpu_relax();
 
-		if (inc.head == inc.tail)
+		head = ACCESS_ONCE(lock->tickets.head);
+		if (head == ticket)
 			break;
 	}
 }

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 3/5] x86,smp: auto tune spinlock backoff delay factor
  2013-01-03  5:15 [RFC PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
  2013-01-03  5:18 ` [RFC PATCH 1/5] x86,smp: move waiting on contended ticket lock out of line Rik van Riel
  2013-01-03  5:22 ` [RFC PATCH 2/5] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
@ 2013-01-03  5:23 ` Rik van Riel
  2013-01-03 12:31   ` Michel Lespinasse
  2013-01-03  5:24 ` [RFC PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address Rik van Riel
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Rik van Riel @ 2013-01-03  5:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, walken, eric.dumazet, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner, knoel

Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper "Non-scalable locks are dangerous" is a good reference:

	http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

Proportional spinlock delay with a high delay factor works well
when there is lots contention on a lock. Likewise, a smaller
delay factor works well when a lock is lightly contended.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 arch/x86/kernel/smp.c |   41 ++++++++++++++++++++++++++++++++++++++---
 1 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index aa743e9..f1ec7f7 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -113,12 +113,31 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
 static bool smp_no_nmi_ipi = false;
 
 /*
- * Wait on a congested ticket spinlock.
+ * Wait on a congested ticket spinlock. Many spinlocks are embedded in
+ * data structures; having many CPUs pounce on the cache line with the
+ * spinlock simultaneously can slow down the lock holder, and the system
+ * as a whole.
+ *
+ * To prevent total performance collapse in case of bad spinlock contention,
+ * perform proportional backoff. The per-cpu value of delay is automatically
+ * tuned to limit the number of times spinning CPUs poll the lock before
+ * obtaining it. This limits the amount of cross-CPU traffic required to obtain
+ * a spinlock, and keeps system performance from dropping off a cliff.
+ *
+ * There is a tradeoff. If we poll too often, the whole system is slowed
+ * down. If we sleep too long, the lock will go unused for a period of
+ * time. The solution is to go for a fast spin if we are at the head of
+ * the queue, to slowly increase the delay if we sleep for too short a
+ * time, and to decrease the delay if we slept for too long.
  */
+#define MIN_SPINLOCK_DELAY 1
+#define MAX_SPINLOCK_DELAY 16000
+DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
 	__ticket_t head = inc.head, ticket = inc.tail;
 	__ticket_t waiters_ahead;
+	int delay = __this_cpu_read(spinlock_delay);
 	unsigned loops;
 
 	for (;;) {
@@ -133,14 +152,30 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 			} while (ACCESS_ONCE(lock->tickets.head) != ticket);
 			break;
 		}
-		loops = 50 * waiters_ahead;
+
+		/*
+		 * The lock is still busy; slowly increase the delay. If we
+		 * end up sleeping too long, the code below will reduce the
+		 * delay. Ideally we acquire the lock in the tight loop above.
+		 */
+		if (!(head % 7) && delay < MAX_SPINLOCK_DELAY)
+			delay++;
+
+		loops = delay * waiters_ahead;
 		while (loops--)
 			cpu_relax();
 
 		head = ACCESS_ONCE(lock->tickets.head);
-		if (head == ticket)
+		if (head == ticket) {
+			/*
+			 * We overslept and have no idea how long the lock
+			 * went idle. Reduce the delay as a precaution.
+			 */
+			delay -= delay/32 + 1;
 			break;
+		}
 	}
+	__this_cpu_write(spinlock_delay, delay);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address
  2013-01-03  5:15 [RFC PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
                   ` (2 preceding siblings ...)
  2013-01-03  5:23 ` [RFC PATCH 3/5] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
@ 2013-01-03  5:24 ` Rik van Riel
  2013-01-03 12:48   ` Michel Lespinasse
  2013-01-03  5:25 ` [RFC PATCH 5/5] x86,smp: add debugging code to track spinlock delay value Rik van Riel
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Rik van Riel @ 2013-01-03  5:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, walken, eric.dumazet, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner, knoel

From: Eric Dumazet <eric.dumazet@gmail.com>

Eric Dumazet found a regression with the spinlock backoff code,
in workloads where multiple spinlocks were contended, each having
a different wait time.

This patch has multiple delay values per cpu, indexed on a hash
of the lock address, to avoid that problem.

Eric Dumazet wrote:

I did some tests with your patches with following configuration :

tc qdisc add dev eth0 root htb r2q 1000 default 3
(to force a contention on qdisc lock, even with a multi queue net
device)

and 24 concurrent "netperf -t UDP_STREAM -H other_machine -- -m 128"

Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
(24 threads), and a fast NIC (10Gbps)

Resulting in a 13 % regression (676 Mbits -> 595 Mbits)

In this workload we have at least two contended spinlocks, with
different delays. (spinlocks are not held for the same duration)

It clearly defeats your assumption of a single per cpu delay being OK :
Some cpus are spinning too long while the lock was released.

We might try to use a hash on lock address, and an array of 16 different
delays so that different spinlocks have a chance of not sharing the same
delay.

With following patch, I get 982 Mbits/s with same bench, so an increase
of 45 % instead of a 13 % regression.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 arch/x86/kernel/smp.c |   22 +++++++++++++++++++---
 1 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 6065291..29360c4 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -23,6 +23,7 @@
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
 #include <linux/gfp.h>
+#include <linux/hash.h>
 
 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
@@ -135,12 +136,26 @@ static bool smp_no_nmi_ipi = false;
  */
 #define MIN_SPINLOCK_DELAY 1
 #define MAX_SPINLOCK_DELAY 16000
-DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };
+#define DELAY_HASH_SHIFT 6
+struct delay_entry {
+	u32 hash;
+	u32 delay;
+};
+static DEFINE_PER_CPU(struct delay_entry [1 << DELAY_HASH_SHIFT], spinlock_delay) = {
+	[0 ... (1 << DELAY_HASH_SHIFT) - 1] = {
+		.hash = 0,
+		.delay = MIN_SPINLOCK_DELAY,
+	},
+};
+
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
 	__ticket_t head = inc.head, ticket = inc.tail;
 	__ticket_t waiters_ahead;
-	int delay = __this_cpu_read(spinlock_delay);
+	u32 hash = hash32_ptr(lock);
+	u32 slot = hash_32(hash, DELAY_HASH_SHIFT);
+	struct delay_entry *ent = &__get_cpu_var(spinlock_delay[slot]);
+	u32 delay = (ent->hash == hash) ? ent->delay : MIN_SPINLOCK_DELAY;
 	unsigned loops;
 
 	for (;;) {
@@ -178,7 +193,8 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 			break;
 		}
 	}
-	__this_cpu_write(spinlock_delay, delay);
+	ent->hash = hash;
+	ent->delay = delay;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 5/5] x86,smp: add debugging code to track spinlock delay value
  2013-01-03  5:15 [RFC PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
                   ` (3 preceding siblings ...)
  2013-01-03  5:24 ` [RFC PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address Rik van Riel
@ 2013-01-03  5:25 ` Rik van Riel
  2013-01-03 10:46 ` [RFC PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Michel Lespinasse
  2013-01-03 11:29 ` Raghavendra KT
  6 siblings, 0 replies; 17+ messages in thread
From: Rik van Riel @ 2013-01-03  5:25 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, walken, eric.dumazet, lwoodman, jeremy,
	Jan Beulich, Thomas Gleixner, knoel

From: Eric Dumazet <eric.dumazet@gmail.com>

This code prints out the maximum spinlock delay value and the
backtrace that pushed it that far.

On systems with serial consoles, the act of printing can cause
the spinlock delay value to explode. It can still be useful as
a debugging tool, but is probably too verbose to merge upstream
in this form.

Not-signed-off-by: Rik van Riel <riel@redhat.com>
Not-signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 arch/x86/kernel/smp.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 29360c4..a4401ed 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -148,6 +148,8 @@ static DEFINE_PER_CPU(struct delay_entry [1 << DELAY_HASH_SHIFT], spinlock_delay
 	},
 };
 
+static DEFINE_PER_CPU(u16, maxdelay);
+
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
 	__ticket_t head = inc.head, ticket = inc.tail;
@@ -195,6 +197,12 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 	}
 	ent->hash = hash;
 	ent->delay = delay;
+
+	if (__this_cpu_read(maxdelay) < delay) {
+		pr_err("cpu %d lock %p delay %d\n", smp_processor_id(), lock, delay);
+		__this_cpu_write(maxdelay, delay);
+		WARN_ON(1);
+	}
 }
 
 /*


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
  2013-01-03  5:15 [RFC PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
                   ` (4 preceding siblings ...)
  2013-01-03  5:25 ` [RFC PATCH 5/5] x86,smp: add debugging code to track spinlock delay value Rik van Riel
@ 2013-01-03 10:46 ` Michel Lespinasse
  2013-01-03 11:29 ` Raghavendra KT
  6 siblings, 0 replies; 17+ messages in thread
From: Michel Lespinasse @ 2013-01-03 10:46 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, eric.dumazet, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner, knoel

On Wed, Jan 2, 2013 at 9:15 PM, Rik van Riel <riel@redhat.com> wrote:
> The v2 series integrates several ideas from Michel Lespinasse
> and Eric Dumazet, which should result in better throughput and
> nicer behaviour in situations with contention on multiple locks.
>
> Please let me know if you manage to break this code in any way,
> so I can fix it...

I'm seeing some very weird things on my 3 test machines. Looks like
the spinlock delay sometimes gets crazy, at which point spinlock
performance becomes unbearable and the network driver freaks out.

# ./spinlock_load_test
 1 spinlocks: 24159990
 2 spinlocks: 12900657
 3 spinlocks: 11547771
 4 spinlocks: 9113
 6 spinlocks: 259
 8 spinlocks: 310
12 spinlocks: 283
16 spinlocks:
<seems to be stuck here; meanwhile my serial console fills up with
network driver cries>

Well, I take it as an incitation to pay special attention to this code review :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 1/5] x86,smp: move waiting on contended ticket lock out of line
  2013-01-03  5:18 ` [RFC PATCH 1/5] x86,smp: move waiting on contended ticket lock out of line Rik van Riel
@ 2013-01-03 10:47   ` Michel Lespinasse
  0 siblings, 0 replies; 17+ messages in thread
From: Michel Lespinasse @ 2013-01-03 10:47 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, eric.dumazet, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner, knoel

On Wed, Jan 2, 2013 at 9:18 PM, Rik van Riel <riel@redhat.com> wrote:
> Moving the wait loop for congested loops to its own function allows
> us to add things to that wait loop, without growing the size of the
> kernel text appreciably.

Looks good :)
Still-reviewed-by: Michel Lespinasse <walken@google.com>

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
  2013-01-03  5:15 [RFC PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
                   ` (5 preceding siblings ...)
  2013-01-03 10:46 ` [RFC PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Michel Lespinasse
@ 2013-01-03 11:29 ` Raghavendra KT
  6 siblings, 0 replies; 17+ messages in thread
From: Raghavendra KT @ 2013-01-03 11:29 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, walken, eric.dumazet, lwoodman, jeremy,
	Jan Beulich, Thomas Gleixner, knoel, Raghavendra KT

[CCing my ibm id]
On Thu, Jan 3, 2013 at 10:45 AM, Rik van Riel <riel@redhat.com> wrote:
> Many spinlocks are embedded in data structures; having many CPUs
> pounce on the cache line the lock is in will slow down the lock
> holder, and can cause system performance to fall off a cliff.
>
> The paper "Non-scalable locks are dangerous" is a good reference:
>
>         http://pdos.csail.mit.edu/papers/linux:lock.pdf
>
> In the Linux kernel, spinlocks are optimized for the case of
> there not being contention. After all, if there is contention,
> the data structure can be improved to reduce or eliminate
> lock contention.
>
> Likewise, the spinlock API should remain simple, and the
> common case of the lock not being contended should remain
> as fast as ever.
>
> However, since spinlock contention should be fairly uncommon,
> we can add functionality into the spinlock slow path that keeps
> system performance from falling off a cliff when there is lock
> contention.
>
> Proportional delay in ticket locks is delaying the time between
> checking the ticket based on a delay factor, and the number of
> CPUs ahead of us in the queue for this lock. Checking the lock
> less often allows the lock holder to continue running, resulting
> in better throughput and preventing performance from dropping
> off a cliff.
>
> The test case has a number of threads locking and unlocking a
> semaphore. With just one thread, everything sits in the CPU
> cache and throughput is around 2.6 million operations per
> second, with a 5-10% variation.
>
> Once a second thread gets involved, data structures bounce
> from CPU to CPU, and performance deteriorates to about 1.25
> million operations per second, with a 5-10% variation.
>
> However, as more and more threads get added to the mix,
> performance with the vanilla kernel continues to deteriorate.
> Once I hit 24 threads, on a 24 CPU, 4 node test system,
> performance is down to about 290k operations/second.
>
> With a proportional backoff delay added to the spinlock
> code, performance with 24 threads goes up to about 400k
> operations/second with a 50x delay, and about 900k operations/second
> with a 250x delay. However, with a 250x delay, performance with
> 2-5 threads is worse than with a 50x delay.
>
> Making the code auto-tune the delay factor results in a system
> that performs well with both light and heavy lock contention,
> and should also protect against the (likely) case of the fixed
> delay factor being wrong for other hardware.
>
> The attached graph shows the performance of the multi threaded
> semaphore lock/unlock test case, with 1-24 threads, on the
> vanilla kernel, with 10x, 50x, and 250x proportional delay,
> as well as the v1 patch series with autotuning for 2x and 2.7x
> spinning before the lock is obtained, and with the v2 series.
>
> The v2 series integrates several ideas from Michel Lespinasse
> and Eric Dumazet, which should result in better throughput and
> nicer behaviour in situations with contention on multiple locks.
>
> Please let me know if you manage to break this code in any way,
> so I can fix it...
>

Hi Rik,
 Whole series looks very interesting.Thanks for posting spinlock. I am
also curious, how the series affect virtualization cases.
(was about to reply for V1 with Eric changes but delayed because of vacation).

I am planning try V2 on baremetal and guests and comeback.

On a related note,
While experimenting with PV spinlock, I had tried "weighed spinlock".
rationale behind the patch is in virtualized environment use case is
exactly opposite.

If head and tail difference is more, probability of getting lock is
very less so, spin for only little time when difference is high.
and after loop is over if we have not got lock, halt (pv spinlock)/
yield to better guy.

Here is the patch for reference that I tried on top of PV spinlocks.

summary of patch :
looping is proportional to

2 * SPIN_THRESHOLD
---------------------------------
( head - tail - 1)

---8<---
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index e6881fd..2f637ce 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -53,6 +53,18 @@ static inline void
__ticket_enter_slowpath(arch_spinlock_t *lock)
 	set_bit(0, (volatile unsigned long *)&lock->tickets.tail);
 }

+static inline unsigned get_spin_threshold(int diff)
+{
+	unsigned count = SPIN_THRESHOLD;
+
+	/* handle wrap around */
+	if (unlikely(diff < 0))
+		diff += TICKETLOCK_MAX_VAL;
+
+	count = count >> ((diff - 1) >> 1);
+	return count;
+}
+
 #else  /* !CONFIG_PARAVIRT_SPINLOCKS */
 static __always_inline void __ticket_lock_spinning(arch_spinlock_t *lock,
 							__ticket_t ticket)
@@ -62,6 +74,10 @@ static inline void
__ticket_unlock_kick(arch_spinlock_t *lock,
 							__ticket_t ticket)
 {
 }
+static inline unsigned get_spin_threshold(int diff)
+{
+	return SPIN_THRESHOLD;
+}

 #endif /* CONFIG_PARAVIRT_SPINLOCKS */

@@ -88,8 +104,8 @@ static __always_inline void
arch_spin_lock(arch_spinlock_t *lock)

 	inc.tail &= ~TICKET_SLOWPATH_FLAG;
 	for (;;) {
-		unsigned count = SPIN_THRESHOLD;
-
+		unsigned count = get_spin_threshold(inc.tail -
+					ACCESS_ONCE(lock->tickets.head));
 		do {
 			if (ACCESS_ONCE(lock->tickets.head) == inc.tail)
 				goto out;
diff --git a/arch/x86/include/asm/spinlock_types.h
b/arch/x86/include/asm/spinlock_types.h
index 4f1bea1..22e38a5 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -14,9 +14,11 @@
 #if (CONFIG_NR_CPUS < (256 / __TICKET_LOCK_INC))
 typedef u8  __ticket_t;
 typedef u16 __ticketpair_t;
+#define TICKETLOCK_MAX_VAL (1 << 8)
 #else
 typedef u16 __ticket_t;
 typedef u32 __ticketpair_t;
+#define TICKETLOCK_MAX_VAL (1 << 16)
 #endif

 #define TICKET_LOCK_INC	((__ticket_t)__TICKET_LOCK_INC)

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/5] x86,smp: proportional backoff for ticket spinlocks
  2013-01-03  5:22 ` [RFC PATCH 2/5] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
@ 2013-01-03 11:35   ` Raghavendra KT
  2013-01-03 11:42     ` Michel Lespinasse
  0 siblings, 1 reply; 17+ messages in thread
From: Raghavendra KT @ 2013-01-03 11:35 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, walken, eric.dumazet, lwoodman, jeremy,
	Jan Beulich, Thomas Gleixner, knoel, Raghavendra KT

[Ccing IBM id]
On Thu, Jan 3, 2013 at 10:52 AM, Rik van Riel <riel@redhat.com> wrote:
> Simple fixed value proportional backoff for ticket spinlocks.
> By pounding on the cacheline with the spin lock less often,
> bus traffic is reduced. In cases of a data structure with
> embedded spinlock, the lock holder has a better chance of
> making progress.
>
> If we are next in line behind the current holder of the
> lock, we do a fast spin, so as not to waste any time when
> the lock is released.
>
> The number 50 is likely to be wrong for many setups, and
> this patch is mostly to illustrate the concept of proportional
> backup. The next patch automatically tunes the delay value.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Michel Lespinasse <walken@google.com>
> ---
>  arch/x86/kernel/smp.c |   23 ++++++++++++++++++++---
>  1 files changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index 20da354..9c56fe3 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -117,11 +117,28 @@ static bool smp_no_nmi_ipi = false;
>   */
>  void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>  {
> +       __ticket_t head = inc.head, ticket = inc.tail;
> +       __ticket_t waiters_ahead;
> +       unsigned loops;
> +
>         for (;;) {
> -               cpu_relax();
> -               inc.head = ACCESS_ONCE(lock->tickets.head);
> +               waiters_ahead = ticket - head - 1;
                                             ^^^^^^^^^^^^^^
Just wondering,
Does wraparound affects this?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/5] x86,smp: proportional backoff for ticket spinlocks
  2013-01-03 11:35   ` Raghavendra KT
@ 2013-01-03 11:42     ` Michel Lespinasse
  2013-01-03 18:19       ` Raghavendra K T
  0 siblings, 1 reply; 17+ messages in thread
From: Michel Lespinasse @ 2013-01-03 11:42 UTC (permalink / raw)
  To: Raghavendra KT
  Cc: Rik van Riel, linux-kernel, aquini, eric.dumazet, lwoodman,
	jeremy, Jan Beulich, Thomas Gleixner, knoel, Raghavendra KT

On Thu, Jan 3, 2013 at 3:35 AM, Raghavendra KT
<raghavendra.kt.linux@gmail.com> wrote:
> [Ccing IBM id]
> On Thu, Jan 3, 2013 at 10:52 AM, Rik van Riel <riel@redhat.com> wrote:
>> Simple fixed value proportional backoff for ticket spinlocks.
>> By pounding on the cacheline with the spin lock less often,
>> bus traffic is reduced. In cases of a data structure with
>> embedded spinlock, the lock holder has a better chance of
>> making progress.
>>
>> If we are next in line behind the current holder of the
>> lock, we do a fast spin, so as not to waste any time when
>> the lock is released.
>>
>> The number 50 is likely to be wrong for many setups, and
>> this patch is mostly to illustrate the concept of proportional
>> backup. The next patch automatically tunes the delay value.
>>
>> Signed-off-by: Rik van Riel <riel@redhat.com>
>> Signed-off-by: Michel Lespinasse <walken@google.com>
>> ---
>>  arch/x86/kernel/smp.c |   23 ++++++++++++++++++++---
>>  1 files changed, 20 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
>> index 20da354..9c56fe3 100644
>> --- a/arch/x86/kernel/smp.c
>> +++ b/arch/x86/kernel/smp.c
>> @@ -117,11 +117,28 @@ static bool smp_no_nmi_ipi = false;
>>   */
>>  void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>>  {
>> +       __ticket_t head = inc.head, ticket = inc.tail;
>> +       __ticket_t waiters_ahead;
>> +       unsigned loops;
>> +
>>         for (;;) {
>> -               cpu_relax();
>> -               inc.head = ACCESS_ONCE(lock->tickets.head);
>> +               waiters_ahead = ticket - head - 1;
>                                              ^^^^^^^^^^^^^^
> Just wondering,
> Does wraparound affects this?

The result gets stored in waiters_ahead, which is unsigned and has
same bit size as ticket and head. So, this takes care of the
wraparound issue.

In other words, you may have to add 1<<8 or 1<<16 if the integer
difference was negative; but you get that for free by just computing
the difference as a 8 or 16 bit unsigned value.


-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 3/5] x86,smp: auto tune spinlock backoff delay factor
  2013-01-03  5:23 ` [RFC PATCH 3/5] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
@ 2013-01-03 12:31   ` Michel Lespinasse
  2013-01-03 17:17     ` Rik van Riel
  0 siblings, 1 reply; 17+ messages in thread
From: Michel Lespinasse @ 2013-01-03 12:31 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, eric.dumazet, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner, knoel

On Wed, Jan 2, 2013 at 9:23 PM, Rik van Riel <riel@redhat.com> wrote:
> Proportional spinlock delay with a high delay factor works well
> when there is lots contention on a lock. Likewise, a smaller
> delay factor works well when a lock is lightly contended.
>
> Making the code auto-tune the delay factor results in a system
> that performs well with both light and heavy lock contention.

I don't quite like that part of the explanation - Ideally I would like
to explain that there are huge benefits in having the delay be at
least equivalent to the time taken by a no-load spin/release section,
and that beyond that there are small additional benefits to using a
longer delay when the spinlock hold times are larger, and an
explanation of why linear increase / exponential delay works nicely
here.

I'll see if I can make a more concrete proposal and still keep it
short enough :)

> +#define MIN_SPINLOCK_DELAY 1
> +#define MAX_SPINLOCK_DELAY 16000
> +DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };

unsigned would seem more natural here, though it's only a tiny detail

>  void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>  {
>         __ticket_t head = inc.head, ticket = inc.tail;
>         __ticket_t waiters_ahead;
> +       int delay = __this_cpu_read(spinlock_delay);

I like that you used __this_cpu_read() - your v1 version wasn't as nice.

> +
> +               /*
> +                * The lock is still busy; slowly increase the delay. If we
> +                * end up sleeping too long, the code below will reduce the
> +                * delay. Ideally we acquire the lock in the tight loop above.
> +                */
> +               if (!(head % 7) && delay < MAX_SPINLOCK_DELAY)
> +                       delay++;
> +
> +               loops = delay * waiters_ahead;

I don't like the head % 7 thing. I think using fixed point arithmetic
would be nicer:

if (delay < MAX_SPINLOCK_DELAY)
  delay += 256/7; /* Or whatever constant we choose */

loops = (delay * waiter_ahead) >> 8;

Also, we should probably skip the delay increment on the first loop
iteration - after all, we haven't waited yet, so we can't say that the
delay was too short.

> -               if (head == ticket)
> +               if (head == ticket) {
> +                       /*
> +                        * We overslept and have no idea how long the lock
> +                        * went idle. Reduce the delay as a precaution.
> +                        */
> +                       delay -= delay/32 + 1;

There is a possibility of integer underflow here. It seems that the
window to hit it would be quite small, but it can happen (at the very
least, imagine getting an interrupt at an inopportune time which would
make it look like you overslept). I think this may be what was causing
the stability issues I noticed.

I'll try to come up with a custom version of this patch. I do feel
that we're going in a good direction in general, though.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address
  2013-01-03  5:24 ` [RFC PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address Rik van Riel
@ 2013-01-03 12:48   ` Michel Lespinasse
  2013-01-03 13:05     ` Eric Dumazet
  0 siblings, 1 reply; 17+ messages in thread
From: Michel Lespinasse @ 2013-01-03 12:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, eric.dumazet, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner, knoel

On Wed, Jan 2, 2013 at 9:24 PM, Rik van Riel <riel@redhat.com> wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
>
> Eric Dumazet found a regression with the spinlock backoff code,
> in workloads where multiple spinlocks were contended, each having
> a different wait time.

I think you should really clarify that the regression was observed
with version 1 of your proposal. At that time,

1- the autotune code tended to use too long delays for long held locks, and

2- there was no exponential backoff, which meant that sharing stats
between a long held and a short held spinlock could really hurt
throughput on the short held spinlock

I believe that with autotune v2, this really shouldnt be a problem and
stats sharing should result in just using a delay that's appropriate
for the shorter of the two lock hold times - which is not the optimal
value, but is actually close enough performance wise and, most
importantly, should not cause any regression when compared to current
mainline.

(it's important to point that out because otherwise, you might trick
Linus into thinking your patches are risky, which I think they
shouldn't be after you implemented exponential backoff)

>  void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>  {
>         __ticket_t head = inc.head, ticket = inc.tail;
>         __ticket_t waiters_ahead;
> -       int delay = __this_cpu_read(spinlock_delay);
> +       u32 hash = hash32_ptr(lock);
> +       u32 slot = hash_32(hash, DELAY_HASH_SHIFT);
> +       struct delay_entry *ent = &__get_cpu_var(spinlock_delay[slot]);
> +       u32 delay = (ent->hash == hash) ? ent->delay : MIN_SPINLOCK_DELAY;

IMO we want to avoid MIN_SPINLOCK_DELAY here. The exponential backoff
autotune should make us resilient to collisions (if they happen we'll
just end up with something very close to the min of the delays that
would have been appropriate for either locks), so it should be better
to just let collisions happen rather than force the use of
MIN_SPINLOCK_DELAY.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address
  2013-01-03 12:48   ` Michel Lespinasse
@ 2013-01-03 13:05     ` Eric Dumazet
  0 siblings, 0 replies; 17+ messages in thread
From: Eric Dumazet @ 2013-01-03 13:05 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Rik van Riel, linux-kernel, aquini, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner, knoel

On Thu, 2013-01-03 at 04:48 -0800, Michel Lespinasse wrote:
> On Wed, Jan 2, 2013 at 9:24 PM, Rik van Riel <riel@redhat.com> wrote:
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> >
> > Eric Dumazet found a regression with the spinlock backoff code,
> > in workloads where multiple spinlocks were contended, each having
> > a different wait time.
> 
> I think you should really clarify that the regression was observed
> with version 1 of your proposal. At that time,
> 
> 1- the autotune code tended to use too long delays for long held locks, and
> 
> 2- there was no exponential backoff, which meant that sharing stats
> between a long held and a short held spinlock could really hurt
> throughput on the short held spinlock
> 
> 
> I believe that with autotune v2, this really shouldnt be a problem and
> stats sharing should result in just using a delay that's appropriate
> for the shorter of the two lock hold times - which is not the optimal
> value, but is actually close enough performance wise and, most
> importantly, should not cause any regression when compared to current
> mainline.
> 
> (it's important to point that out because otherwise, you might trick
> Linus into thinking your patches are risky, which I think they
> shouldn't be after you implemented exponential backoff)
> 
> >  void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
> >  {
> >         __ticket_t head = inc.head, ticket = inc.tail;
> >         __ticket_t waiters_ahead;
> > -       int delay = __this_cpu_read(spinlock_delay);
> > +       u32 hash = hash32_ptr(lock);
> > +       u32 slot = hash_32(hash, DELAY_HASH_SHIFT);
> > +       struct delay_entry *ent = &__get_cpu_var(spinlock_delay[slot]);
> > +       u32 delay = (ent->hash == hash) ? ent->delay : MIN_SPINLOCK_DELAY;
> 
> IMO we want to avoid MIN_SPINLOCK_DELAY here. The exponential backoff
> autotune should make us resilient to collisions (if they happen we'll
> just end up with something very close to the min of the delays that
> would have been appropriate for either locks), so it should be better
> to just let collisions happen rather than force the use of
> MIN_SPINLOCK_DELAY.
> 

exponential backoff wont help, I tried this idea last week and found
that its better to detect hash collision and safely use
MIN_SPINLOCK_DELAY in this case.

Its better to not overestimate the delay and spin much longer than
needed.

On a hash collision, we dont know at all the contention history of this
lock, unless we store the EWMA delay inside the lock.

(On x86 and NR_CPUS <= 256, we have a 16 bit hole in the spinlock that
 we could use for this)




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 3/5] x86,smp: auto tune spinlock backoff delay factor
  2013-01-03 12:31   ` Michel Lespinasse
@ 2013-01-03 17:17     ` Rik van Riel
  2013-01-05  0:45       ` Rik van Riel
  0 siblings, 1 reply; 17+ messages in thread
From: Rik van Riel @ 2013-01-03 17:17 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-kernel, aquini, eric.dumazet, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner, knoel

On 01/03/2013 07:31 AM, Michel Lespinasse wrote:

> I'll see if I can make a more concrete proposal and still keep it
> short enough :)

Looking forward to that. I have thought about it some more,
and am still not sure about a better description for the
changelog...

>> +#define MIN_SPINLOCK_DELAY 1
>> +#define MAX_SPINLOCK_DELAY 16000
>> +DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };
>
> unsigned would seem more natural here, though it's only a tiny detail

I might as well make that change while addressing the issues
you found :)

>> +
>> +               /*
>> +                * The lock is still busy; slowly increase the delay. If we
>> +                * end up sleeping too long, the code below will reduce the
>> +                * delay. Ideally we acquire the lock in the tight loop above.
>> +                */
>> +               if (!(head % 7) && delay < MAX_SPINLOCK_DELAY)
>> +                       delay++;
>> +
>> +               loops = delay * waiters_ahead;
>
> I don't like the head % 7 thing. I think using fixed point arithmetic
> would be nicer:
>
> if (delay < MAX_SPINLOCK_DELAY)
>    delay += 256/7; /* Or whatever constant we choose */
>
> loops = (delay * waiter_ahead) >> 8;

I'll do that. That could get completely rid of any artifacts
caused by incrementing sometimes, and not other times.

> Also, we should probably skip the delay increment on the first loop
> iteration - after all, we haven't waited yet, so we can't say that the
> delay was too short.

Good point. I will do that.

>> -               if (head == ticket)
>> +               if (head == ticket) {
>> +                       /*
>> +                        * We overslept and have no idea how long the lock
>> +                        * went idle. Reduce the delay as a precaution.
>> +                        */
>> +                       delay -= delay/32 + 1;
>
> There is a possibility of integer underflow here.

Fixed in my local code base now.

I will build a kernel with the things you pointed out fixed,
and will give it a spin this afternoon.

Expect new patches soonish :)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 2/5] x86,smp: proportional backoff for ticket spinlocks
  2013-01-03 11:42     ` Michel Lespinasse
@ 2013-01-03 18:19       ` Raghavendra K T
  0 siblings, 0 replies; 17+ messages in thread
From: Raghavendra K T @ 2013-01-03 18:19 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Raghavendra KT, Rik van Riel, linux-kernel, aquini, eric.dumazet,
	lwoodman, jeremy, Jan Beulich, Thomas Gleixner, knoel

On 01/03/2013 05:12 PM, Michel Lespinasse wrote:
> On Thu, Jan 3, 2013 at 3:35 AM, Raghavendra KT
> <raghavendra.kt.linux@gmail.com> wrote:
>> [Ccing IBM id]
>> On Thu, Jan 3, 2013 at 10:52 AM, Rik van Riel <riel@redhat.com> wrote:
>>> Simple fixed value proportional backoff for ticket spinlocks.
>>> By pounding on the cacheline with the spin lock less often,
>>> bus traffic is reduced. In cases of a data structure with
>>> embedded spinlock, the lock holder has a better chance of
>>> making progress.
>>>
>>> If we are next in line behind the current holder of the
>>> lock, we do a fast spin, so as not to waste any time when
>>> the lock is released.
>>>
>>> The number 50 is likely to be wrong for many setups, and
>>> this patch is mostly to illustrate the concept of proportional
>>> backup. The next patch automatically tunes the delay value.
>>>
>>> Signed-off-by: Rik van Riel <riel@redhat.com>
>>> Signed-off-by: Michel Lespinasse <walken@google.com>
>>> ---
>>>   arch/x86/kernel/smp.c |   23 ++++++++++++++++++++---
>>>   1 files changed, 20 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
>>> index 20da354..9c56fe3 100644
>>> --- a/arch/x86/kernel/smp.c
>>> +++ b/arch/x86/kernel/smp.c
>>> @@ -117,11 +117,28 @@ static bool smp_no_nmi_ipi = false;
>>>    */
>>>   void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>>>   {
>>> +       __ticket_t head = inc.head, ticket = inc.tail;
>>> +       __ticket_t waiters_ahead;
>>> +       unsigned loops;
>>> +
>>>          for (;;) {
>>> -               cpu_relax();
>>> -               inc.head = ACCESS_ONCE(lock->tickets.head);
>>> +               waiters_ahead = ticket - head - 1;
>>                                               ^^^^^^^^^^^^^^
>> Just wondering,
>> Does wraparound affects this?
>
> The result gets stored in waiters_ahead, which is unsigned and has
> same bit size as ticket and head. So, this takes care of the
> wraparound issue.
>
> In other words, you may have to add 1<<8 or 1<<16 if the integer
> difference was negative; but you get that for free by just computing
> the difference as a 8 or 16 bit unsigned value.
>

Michael,
Sorry for the noise and for missing the simple math :) and Thanks for 
explanation.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 3/5] x86,smp: auto tune spinlock backoff delay factor
  2013-01-03 17:17     ` Rik van Riel
@ 2013-01-05  0:45       ` Rik van Riel
  0 siblings, 0 replies; 17+ messages in thread
From: Rik van Riel @ 2013-01-05  0:45 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-kernel, aquini, eric.dumazet, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner, knoel

On 01/03/2013 12:17 PM, Rik van Riel wrote:

>>> +               if (!(head % 7) && delay < MAX_SPINLOCK_DELAY)
>>> +                       delay++;
>>> +
>>> +               loops = delay * waiters_ahead;
>>
>> I don't like the head % 7 thing. I think using fixed point arithmetic
>> would be nicer:
>>
>> if (delay < MAX_SPINLOCK_DELAY)
>>    delay += 256/7; /* Or whatever constant we choose */
>>
>> loops = (delay * waiter_ahead) >> 8;
>
> I'll do that. That could get completely rid of any artifacts
> caused by incrementing sometimes, and not other times.
>
>> Also, we should probably skip the delay increment on the first loop
>> iteration - after all, we haven't waited yet, so we can't say that the
>> delay was too short.
>
> Good point. I will do that.

> I will build a kernel with the things you pointed out fixed,
> and will give it a spin this afternoon.
>
> Expect new patches soonish :)

After implementing all the ideas you came up with, which made
perfect sense to me, the code performs significantly worse
than before.

*sigh*

New patches will be coming ... later.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2013-01-05  0:45 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-03  5:15 [RFC PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
2013-01-03  5:18 ` [RFC PATCH 1/5] x86,smp: move waiting on contended ticket lock out of line Rik van Riel
2013-01-03 10:47   ` Michel Lespinasse
2013-01-03  5:22 ` [RFC PATCH 2/5] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
2013-01-03 11:35   ` Raghavendra KT
2013-01-03 11:42     ` Michel Lespinasse
2013-01-03 18:19       ` Raghavendra K T
2013-01-03  5:23 ` [RFC PATCH 3/5] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
2013-01-03 12:31   ` Michel Lespinasse
2013-01-03 17:17     ` Rik van Riel
2013-01-05  0:45       ` Rik van Riel
2013-01-03  5:24 ` [RFC PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address Rik van Riel
2013-01-03 12:48   ` Michel Lespinasse
2013-01-03 13:05     ` Eric Dumazet
2013-01-03  5:25 ` [RFC PATCH 5/5] x86,smp: add debugging code to track spinlock delay value Rik van Riel
2013-01-03 10:46 ` [RFC PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Michel Lespinasse
2013-01-03 11:29 ` Raghavendra KT

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).