[PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
@ 2013-01-08 22:26 Rik van Riel
  2013-01-08 22:30 ` [PATCH 3/5] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
                   ` (7 more replies)
  0 siblings, 8 replies; 27+ messages in thread
From: Rik van Riel @ 2013-01-08 22:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, walken, eric.dumazet, lwoodman, jeremy, Jan Beulich,
	knoel, chegu_vinod, raghavendra.kt, mingo

[-- Attachment #1: Type: text/plain, Size: 3290 bytes --]

Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper "Non-scalable locks are dangerous" is a good reference:

	http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

The test case has a number of threads locking and unlocking a
semaphore. With just one thread, everything sits in the CPU
cache and throughput is around 2.6 million operations per
second, with a 5-10% variation.

Once a second thread gets involved, data structures bounce
from CPU to CPU, and performance deteriorates to about 1.25
million operations per second, with a 5-10% variation.

However, as more and more threads get added to the mix,
performance with the vanilla kernel continues to deteriorate.
Once I hit 24 threads, on a 24 CPU, 4 node test system,
performance is down to about 290k operations/second.

With a proportional backoff delay added to the spinlock
code, performance with 24 threads goes up to about 400k
operations/second with a 50x delay, and about 900k operations/second
with a 250x delay. However, with a 250x delay, performance with
2-5 threads is worse than with a 50x delay.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention,
and should also protect against the (likely) case of the fixed
delay factor being wrong for other hardware.

The attached graph shows the performance of the multi threaded
semaphore lock/unlock test case, with 1-24 threads, on the
vanilla kernel, with 10x, 50x, and 250x proportional delay,
as well as the v1 patch series with autotuning for 2x and 2.7x
spinning before the lock is obtained, and with the v2 series.

The v2 series integrates several ideas from Michel Lespinasse
and Eric Dumazet, which should result in better throughput and
nicer behaviour in situations with contention on multiple locks.

For the v3 series, I tried out all the ideas suggested by
Michel. They made perfect sense, but in the end it turned
out they did not work as well as the simple, aggressive
"try to make the delay longer" policy I have now. Several
small bug fixes and cleanups have been integrated.

Performance is within the margin of error of v2, so the graph
has not been update.

Please let me know if you manage to break this code in any way,
so I can fix it...

-- 
All rights reversed.

[-- Attachment #2: spinlock-backoff-v2.png --]
[-- Type: image/png, Size: 21964 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 3/5] x86,smp: auto tune spinlock backoff delay factor
  2013-01-08 22:26 [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
@ 2013-01-08 22:30 ` Rik van Riel
  2013-01-10  3:13   ` Rafael Aquini
  2013-01-10 12:49   ` Michel Lespinasse
  2013-01-08 22:31 ` [PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address Rik van Riel
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 27+ messages in thread
From: Rik van Riel @ 2013-01-08 22:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, walken, eric.dumazet, lwoodman, jeremy, Jan Beulich,
	knoel, chegu_vinod, raghavendra.kt, mingo

Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper "Non-scalable locks are dangerous" is a good reference:

	http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

Proportional spinlock delay with a high delay factor works well
when there is lots contention on a lock. Likewise, a smaller
delay factor works well when a lock is lightly contended.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
v3: use fixed-point math for the delay calculations, suggested by Michel Lespinasse

 arch/x86/kernel/smp.c |   43 +++++++++++++++++++++++++++++++++++++++----
 1 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index aa743e9..05f828b 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -113,13 +113,34 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
 static bool smp_no_nmi_ipi = false;
 
 /*
- * Wait on a congested ticket spinlock.
+ * Wait on a congested ticket spinlock. Many spinlocks are embedded in
+ * data structures; having many CPUs pounce on the cache line with the
+ * spinlock simultaneously can slow down the lock holder, and the system
+ * as a whole.
+ *
+ * To prevent total performance collapse in case of bad spinlock contention,
+ * perform proportional backoff. The per-cpu value of delay is automatically
+ * tuned to limit the number of times spinning CPUs poll the lock before
+ * obtaining it. This limits the amount of cross-CPU traffic required to obtain
+ * a spinlock, and keeps system performance from dropping off a cliff.
+ *
+ * There is a tradeoff. If we poll too often, the whole system is slowed
+ * down. If we sleep too long, the lock will go unused for a period of
+ * time. The solution is to go for a fast spin if we are at the head of
+ * the queue, to slowly increase the delay if we sleep for too short a
+ * time, and to decrease the delay if we slept for too long.
  */
+#define DELAY_SHIFT 8
+#define DELAY_FIXED_1 (1<<DELAY_SHIFT)
+#define MIN_SPINLOCK_DELAY (1 * DELAY_FIXED_1)
+#define MAX_SPINLOCK_DELAY (16000 * DELAY_FIXED_1)
+DEFINE_PER_CPU(unsigned, spinlock_delay) = { MIN_SPINLOCK_DELAY };
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
 	__ticket_t head = inc.head, ticket = inc.tail;
 	__ticket_t waiters_ahead;
-	unsigned loops;
+	unsigned delay = __this_cpu_read(spinlock_delay);
+	unsigned loops = 1;
 
 	for (;;) {
 		waiters_ahead = ticket - head - 1;
@@ -133,14 +154,28 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 			} while (ACCESS_ONCE(lock->tickets.head) != ticket);
 			break;
 		}
-		loops = 50 * waiters_ahead;
+
+		/* Aggressively increase delay, to minimize lock accesses. */
+		if (delay < MAX_SPINLOCK_DELAY)
+			delay += DELAY_FIXED_1 / 7;
+
+		loops = (delay * waiters_ahead) >> DELAY_SHIFT;
 		while (loops--)
 			cpu_relax();
 
 		head = ACCESS_ONCE(lock->tickets.head);
-		if (head == ticket)
+		if (head == ticket) {
+			/*
+			 * We overslept, and do not know by how.
+			 * Exponentially decay the value of delay,
+			 * to get it back to a good value quickly.
+			 */
+			if (delay >= 2 * DELAY_FIXED_1)
+				delay -= max(delay/32, DELAY_FIXED_1);
 			break;
+		}
 	}
+	__this_cpu_write(spinlock_delay, delay);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address
  2013-01-08 22:26 [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
  2013-01-08 22:30 ` [PATCH 3/5] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
@ 2013-01-08 22:31 ` Rik van Riel
  2013-01-10  3:14   ` Rafael Aquini
  2013-01-10 13:01   ` Michel Lespinasse
  2013-01-08 22:32 ` [DEBUG PATCH 5/5] x86,smp: add debugging code to track spinlock delay value Rik van Riel
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 27+ messages in thread
From: Rik van Riel @ 2013-01-08 22:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, walken, eric.dumazet, lwoodman, jeremy, Jan Beulich,
	knoel, chegu_vinod, raghavendra.kt, mingo

From: Eric Dumazet <eric.dumazet@gmail.com>

Eric Dumazet found a regression with the first version of the spinlock
backoff code, in a workload where multiple spinlocks were contended,
each having a different wait time.

This patch has multiple delay values per cpu, indexed on a hash
of the lock address, to avoid that problem.

Eric Dumazet wrote:

I did some tests with your patches with following configuration :

tc qdisc add dev eth0 root htb r2q 1000 default 3
(to force a contention on qdisc lock, even with a multi queue net
device)

and 24 concurrent "netperf -t UDP_STREAM -H other_machine -- -m 128"

Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
(24 threads), and a fast NIC (10Gbps)

Resulting in a 13 % regression (676 Mbits -> 595 Mbits)

In this workload we have at least two contended spinlocks, with
different delays. (spinlocks are not held for the same duration)

It clearly defeats your assumption of a single per cpu delay being OK :
Some cpus are spinning too long while the lock was released.

We might try to use a hash on lock address, and an array of 16 different
delays so that different spinlocks have a chance of not sharing the same
delay.

With following patch, I get 982 Mbits/s with same bench, so an increase
of 45 % instead of a 13 % regression.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 arch/x86/kernel/smp.c |   22 +++++++++++++++++++---
 1 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 05f828b..1877890 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -23,6 +23,7 @@
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
 #include <linux/gfp.h>
+#include <linux/hash.h>
 
 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
@@ -134,12 +135,26 @@ static bool smp_no_nmi_ipi = false;
 #define DELAY_FIXED_1 (1<<DELAY_SHIFT)
 #define MIN_SPINLOCK_DELAY (1 * DELAY_FIXED_1)
 #define MAX_SPINLOCK_DELAY (16000 * DELAY_FIXED_1)
-DEFINE_PER_CPU(unsigned, spinlock_delay) = { MIN_SPINLOCK_DELAY };
+#define DELAY_HASH_SHIFT 6
+struct delay_entry {
+	u32 hash;
+	u32 delay;
+};
+static DEFINE_PER_CPU(struct delay_entry [1 << DELAY_HASH_SHIFT], spinlock_delay) = {
+	[0 ... (1 << DELAY_HASH_SHIFT) - 1] = {
+		.hash = 0,
+		.delay = MIN_SPINLOCK_DELAY,
+	},
+};
+
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
 	__ticket_t head = inc.head, ticket = inc.tail;
 	__ticket_t waiters_ahead;
-	unsigned delay = __this_cpu_read(spinlock_delay);
+	u32 hash = hash32_ptr(lock);
+	u32 slot = hash_32(hash, DELAY_HASH_SHIFT);
+	struct delay_entry *ent = &__get_cpu_var(spinlock_delay[slot]);
+	u32 delay = (ent->hash == hash) ? ent->delay : MIN_SPINLOCK_DELAY;
 	unsigned loops = 1;
 
 	for (;;) {
@@ -175,7 +190,8 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 			break;
 		}
 	}
-	__this_cpu_write(spinlock_delay, delay);
+	ent->hash = hash;
+	ent->delay = delay;
 }
 
 /*


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [DEBUG PATCH 5/5] x86,smp: add debugging code to track spinlock delay value
  2013-01-08 22:26 [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
  2013-01-08 22:30 ` [PATCH 3/5] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
  2013-01-08 22:31 ` [PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address Rik van Riel
@ 2013-01-08 22:32 ` Rik van Riel
  2013-01-08 22:32 ` [PATCH 2/5] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: Rik van Riel @ 2013-01-08 22:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, walken, eric.dumazet, lwoodman, jeremy, Jan Beulich,
	knoel, chegu_vinod, raghavendra.kt, mingo

From: Eric Dumazet <eric.dumazet@gmail.com>

This code prints out the maximum spinlock delay value and the
backtrace that pushed it that far.

On systems with serial consoles, the act of printing can cause
the spinlock delay value to explode. It can be useful as a
debugging tool, but is too verbose to merge upstream in this form.

Not-signed-off-by: Rik van Riel <riel@redhat.com>
Not-signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 arch/x86/kernel/smp.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 1877890..d80aee7 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -147,6 +147,8 @@ static DEFINE_PER_CPU(struct delay_entry [1 << DELAY_HASH_SHIFT], spinlock_delay
 	},
 };
 
+static DEFINE_PER_CPU(u32, maxdelay);
+
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
 	__ticket_t head = inc.head, ticket = inc.tail;
@@ -192,6 +194,12 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 	}
 	ent->hash = hash;
 	ent->delay = delay;
+
+	if (__this_cpu_read(maxdelay) * 4 < delay * 3) {
+		pr_err("cpu %d lock %p delay %d\n", smp_processor_id(), lock, delay>>DELAY_SHIFT);
+		__this_cpu_write(maxdelay, delay);
+		WARN_ON(1);
+	}
 }
 
 /*

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/5] x86,smp: proportional backoff for ticket spinlocks
  2013-01-08 22:26 [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
                   ` (2 preceding siblings ...)
  2013-01-08 22:32 ` [DEBUG PATCH 5/5] x86,smp: add debugging code to track spinlock delay value Rik van Riel
@ 2013-01-08 22:32 ` Rik van Riel
  2013-01-08 22:50   ` Eric Dumazet
  2013-01-10  2:30   ` Rafael Aquini
  2013-01-08 22:32 ` [PATCH 1/5] x86,smp: move waiting on contended ticket lock out of line Rik van Riel
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 27+ messages in thread
From: Rik van Riel @ 2013-01-08 22:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, walken, eric.dumazet, lwoodman, jeremy, Jan Beulich,
	knoel, chegu_vinod, raghavendra.kt, mingo

Subject: x86,smp: proportional backoff for ticket spinlocks

Simple fixed value proportional backoff for ticket spinlocks.
By pounding on the cacheline with the spin lock less often,
bus traffic is reduced. In cases of a data structure with
embedded spinlock, the lock holder has a better chance of
making progress.

If we are next in line behind the current holder of the
lock, we do a fast spin, so as not to waste any time when
the lock is released.

The number 50 is likely to be wrong for many setups, and
this patch is mostly to illustrate the concept of proportional
backup. The next patch automatically tunes the delay value.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Michel Lespinasse <walken@google.com>
---
 arch/x86/kernel/smp.c |   23 ++++++++++++++++++++---
 1 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 20da354..aa743e9 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -117,11 +117,28 @@ static bool smp_no_nmi_ipi = false;
  */
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
+	__ticket_t head = inc.head, ticket = inc.tail;
+	__ticket_t waiters_ahead;
+	unsigned loops;
+
 	for (;;) {
-		cpu_relax();
-		inc.head = ACCESS_ONCE(lock->tickets.head);
+		waiters_ahead = ticket - head - 1;
+		/*
+		 * We are next after the current lock holder. Check often
+		 * to avoid wasting time when the lock is released.
+		 */
+		if (!waiters_ahead) {
+			do {
+				cpu_relax();
+			} while (ACCESS_ONCE(lock->tickets.head) != ticket);
+			break;
+		}
+		loops = 50 * waiters_ahead;
+		while (loops--)
+			cpu_relax();
 
-		if (inc.head == inc.tail)
+		head = ACCESS_ONCE(lock->tickets.head);
+		if (head == ticket)
 			break;
 	}
 }


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 1/5] x86,smp: move waiting on contended ticket lock out of line
  2013-01-08 22:26 [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
                   ` (3 preceding siblings ...)
  2013-01-08 22:32 ` [PATCH 2/5] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
@ 2013-01-08 22:32 ` Rik van Riel
  2013-01-08 22:43   ` Eric Dumazet
  2013-01-10 17:38   ` Raghavendra K T
  2013-01-09 12:50 ` [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Raghavendra K T
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 27+ messages in thread
From: Rik van Riel @ 2013-01-08 22:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, walken, eric.dumazet, lwoodman, jeremy, Jan Beulich,
	knoel, chegu_vinod, raghavendra.kt, mingo

Moving the wait loop for congested loops to its own function allows
us to add things to that wait loop, without growing the size of the
kernel text appreciably.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmiss.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Rafael Aquini <aquini@redhat.com>
---
v2: clean up the code a little, after Michel's suggestion 

 arch/x86/include/asm/spinlock.h |   11 +++++------
 arch/x86/kernel/smp.c           |   14 ++++++++++++++
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 33692ea..dc492f6 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -34,6 +34,8 @@
 # define UNLOCK_LOCK_PREFIX
 #endif
 
+extern void ticket_spin_lock_wait(arch_spinlock_t *, struct __raw_tickets);
+
 /*
  * Ticket locks are conceptually two parts, one indicating the current head of
  * the queue, and the other indicating the current tail. The lock is acquired
@@ -53,12 +55,9 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
 
 	inc = xadd(&lock->tickets, inc);
 
-	for (;;) {
-		if (inc.head == inc.tail)
-			break;
-		cpu_relax();
-		inc.head = ACCESS_ONCE(lock->tickets.head);
-	}
+	if (inc.head != inc.tail)
+		ticket_spin_lock_wait(lock, inc);
+
 	barrier();		/* make sure nothing creeps before the lock is taken */
 }
 
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 48d2b7d..20da354 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -113,6 +113,20 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
 static bool smp_no_nmi_ipi = false;
 
 /*
+ * Wait on a congested ticket spinlock.
+ */
+void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
+{
+	for (;;) {
+		cpu_relax();
+		inc.head = ACCESS_ONCE(lock->tickets.head);
+
+		if (inc.head == inc.tail)
+			break;
+	}
+}
+
+/*
  * this function sends a 'reschedule' IPI to another CPU.
  * it goes straight through and wastes no time serializing
  * anything. Worst case is that we lose a reschedule ...


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/5] x86,smp: move waiting on contended ticket lock out of line
  2013-01-08 22:32 ` [PATCH 1/5] x86,smp: move waiting on contended ticket lock out of line Rik van Riel
@ 2013-01-08 22:43   ` Eric Dumazet
  2013-01-10 17:38   ` Raghavendra K T
  1 sibling, 0 replies; 27+ messages in thread
From: Eric Dumazet @ 2013-01-08 22:43 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, walken, lwoodman, jeremy, Jan Beulich,
	knoel, chegu_vinod, raghavendra.kt, mingo

On Tue, 2013-01-08 at 17:32 -0500, Rik van Riel wrote:
> Moving the wait loop for congested loops to its own function allows
> us to add things to that wait loop, without growing the size of the
> kernel text appreciably.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: Steven Rostedt <rostedt@goodmiss.org>
> Reviewed-by: Michel Lespinasse <walken@google.com>
> Reviewed-by: Rafael Aquini <aquini@redhat.com>
> ---
> v2: clean up the code a little, after Michel's suggestion 

Reviewed-by: Eric Dumazet <edumazet@google.com>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/5] x86,smp: proportional backoff for ticket spinlocks
  2013-01-08 22:32 ` [PATCH 2/5] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
@ 2013-01-08 22:50   ` Eric Dumazet
  2013-01-08 22:54     ` Rik van Riel
  2013-01-10  2:30   ` Rafael Aquini
  1 sibling, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2013-01-08 22:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, walken, lwoodman, jeremy, Jan Beulich,
	knoel, chegu_vinod, raghavendra.kt, mingo

On Tue, 2013-01-08 at 17:32 -0500, Rik van Riel wrote:
> Subject: x86,smp: proportional backoff for ticket spinlocks
> 
> Simple fixed value proportional backoff for ticket spinlocks.
> By pounding on the cacheline with the spin lock less often,
> bus traffic is reduced. In cases of a data structure with
> embedded spinlock, the lock holder has a better chance of
> making progress.
> 
> If we are next in line behind the current holder of the
> lock, we do a fast spin, so as not to waste any time when
> the lock is released.
> 
> The number 50 is likely to be wrong for many setups, and
> this patch is mostly to illustrate the concept of proportional
> backup. The next patch automatically tunes the delay value.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Michel Lespinasse <walken@google.com>
> ---
>  arch/x86/kernel/smp.c |   23 ++++++++++++++++++++---
>  1 files changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index 20da354..aa743e9 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -117,11 +117,28 @@ static bool smp_no_nmi_ipi = false;
>   */
>  void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>  {
> +	__ticket_t head = inc.head, ticket = inc.tail;
> +	__ticket_t waiters_ahead;
> +	unsigned loops;
> +
>  	for (;;) {
> -		cpu_relax();
> -		inc.head = ACCESS_ONCE(lock->tickets.head);
> +		waiters_ahead = ticket - head - 1;
> +		/*
> +		 * We are next after the current lock holder. Check often
> +		 * to avoid wasting time when the lock is released.
> +		 */
> +		if (!waiters_ahead) {
> +			do {
> +				cpu_relax();
> +			} while (ACCESS_ONCE(lock->tickets.head) != ticket);
> +			break;
> +		}
> +		loops = 50 * waiters_ahead;
> +		while (loops--)
> +			cpu_relax();
>  
> -		if (inc.head == inc.tail)
> +		head = ACCESS_ONCE(lock->tickets.head);
> +		if (head == ticket)
>  			break;
>  	}
>  }
> 

Reviewed-by: Eric Dumazet <edumazet@google.com>

In my tests, I used the following formula :

loops = 50 * ((ticket - head) - 1/2);

or :

delta = ticket - head;
loops = delay * delta - (delay >> 1);

(And I didnt use the special :

	if (!waiters_ahead) {
		do {
			cpu_relax();
		} while (ACCESS_ONCE(lock->tickets.head) != ticket);
		break;
	}

Because it means this wont help machines with 2 cpus.

(or more generally if there _is_ contention, but with
one lock holder and one lock waiter)




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/5] x86,smp: proportional backoff for ticket spinlocks
  2013-01-08 22:50   ` Eric Dumazet
@ 2013-01-08 22:54     ` Rik van Riel
  0 siblings, 0 replies; 27+ messages in thread
From: Rik van Riel @ 2013-01-08 22:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux-kernel, aquini, walken, lwoodman, jeremy, Jan Beulich,
	knoel, chegu_vinod, raghavendra.kt, mingo

On 01/08/2013 05:50 PM, Eric Dumazet wrote:
> On Tue, 2013-01-08 at 17:32 -0500, Rik van Riel wrote:
>> Subject: x86,smp: proportional backoff for ticket spinlocks
>>
>> Simple fixed value proportional backoff for ticket spinlocks.
>> By pounding on the cacheline with the spin lock less often,
>> bus traffic is reduced. In cases of a data structure with
>> embedded spinlock, the lock holder has a better chance of
>> making progress.
>>
>> If we are next in line behind the current holder of the
>> lock, we do a fast spin, so as not to waste any time when
>> the lock is released.
>>
>> The number 50 is likely to be wrong for many setups, and
>> this patch is mostly to illustrate the concept of proportional
>> backup. The next patch automatically tunes the delay value.
>>
>> Signed-off-by: Rik van Riel <riel@redhat.com>
>> Signed-off-by: Michel Lespinasse <walken@google.com>
>> ---
>>   arch/x86/kernel/smp.c |   23 ++++++++++++++++++++---
>>   1 files changed, 20 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
>> index 20da354..aa743e9 100644
>> --- a/arch/x86/kernel/smp.c
>> +++ b/arch/x86/kernel/smp.c
>> @@ -117,11 +117,28 @@ static bool smp_no_nmi_ipi = false;
>>    */
>>   void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>>   {
>> +	__ticket_t head = inc.head, ticket = inc.tail;
>> +	__ticket_t waiters_ahead;
>> +	unsigned loops;
>> +
>>   	for (;;) {
>> -		cpu_relax();
>> -		inc.head = ACCESS_ONCE(lock->tickets.head);
>> +		waiters_ahead = ticket - head - 1;
>> +		/*
>> +		 * We are next after the current lock holder. Check often
>> +		 * to avoid wasting time when the lock is released.
>> +		 */
>> +		if (!waiters_ahead) {
>> +			do {
>> +				cpu_relax();
>> +			} while (ACCESS_ONCE(lock->tickets.head) != ticket);
>> +			break;
>> +		}
>> +		loops = 50 * waiters_ahead;
>> +		while (loops--)
>> +			cpu_relax();
>>
>> -		if (inc.head == inc.tail)
>> +		head = ACCESS_ONCE(lock->tickets.head);
>> +		if (head == ticket)
>>   			break;
>>   	}
>>   }
>>
>
> Reviewed-by: Eric Dumazet <edumazet@google.com>
>
> In my tests, I used the following formula :
>
> loops = 50 * ((ticket - head) - 1/2);
>
> or :
>
> delta = ticket - head;
> loops = delay * delta - (delay >> 1);

I suppose that rounding down the delta might result
in more stable results, due to undersleeping less
often.

> (And I didnt use the special :
>
> 	if (!waiters_ahead) {
> 		do {
> 			cpu_relax();
> 		} while (ACCESS_ONCE(lock->tickets.head) != ticket);
> 		break;
> 	}
>
> Because it means this wont help machines with 2 cpus.
>
> (or more generally if there _is_ contention, but with
> one lock holder and one lock waiter)

Machines with 2 CPUs should not need help, because the
cpu_relax() alone gives enough of a pause that the lock
holder can make progress.

It may be interesting to try out your rounding-down of
delta, to see if that makes things better.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
  2013-01-08 22:26 [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
                   ` (4 preceding siblings ...)
  2013-01-08 22:32 ` [PATCH 1/5] x86,smp: move waiting on contended ticket lock out of line Rik van Riel
@ 2013-01-09 12:50 ` Raghavendra K T
  2013-01-10  2:27   ` Rafael Aquini
  2013-01-10 15:19 ` Mike Galbraith
  2013-01-10 22:24 ` Chegu Vinod
  7 siblings, 1 reply; 27+ messages in thread
From: Raghavendra K T @ 2013-01-09 12:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, walken, eric.dumazet, lwoodman, jeremy,
	Jan Beulich, knoel, chegu_vinod, mingo

On 01/09/2013 03:56 AM, Rik van Riel wrote:
> Many spinlocks are embedded in data structures; having many CPUs
> pounce on the cache line the lock is in will slow down the lock
> holder, and can cause system performance to fall off a cliff.
>
> The paper "Non-scalable locks are dangerous" is a good reference:
>
> 	http://pdos.csail.mit.edu/papers/linux:lock.pdf
>
> In the Linux kernel, spinlocks are optimized for the case of
> there not being contention. After all, if there is contention,
> the data structure can be improved to reduce or eliminate
> lock contention.
>
> Likewise, the spinlock API should remain simple, and the
> common case of the lock not being contended should remain
> as fast as ever.
>
> However, since spinlock contention should be fairly uncommon,
> we can add functionality into the spinlock slow path that keeps
> system performance from falling off a cliff when there is lock
> contention.
>
> Proportional delay in ticket locks is delaying the time between
> checking the ticket based on a delay factor, and the number of
> CPUs ahead of us in the queue for this lock. Checking the lock
> less often allows the lock holder to continue running, resulting
> in better throughput and preventing performance from dropping
> off a cliff.
>
> The test case has a number of threads locking and unlocking a
> semaphore. With just one thread, everything sits in the CPU
> cache and throughput is around 2.6 million operations per
> second, with a 5-10% variation.
>
> Once a second thread gets involved, data structures bounce
> from CPU to CPU, and performance deteriorates to about 1.25
> million operations per second, with a 5-10% variation.
>
> However, as more and more threads get added to the mix,
> performance with the vanilla kernel continues to deteriorate.
> Once I hit 24 threads, on a 24 CPU, 4 node test system,
> performance is down to about 290k operations/second.
>
> With a proportional backoff delay added to the spinlock
> code, performance with 24 threads goes up to about 400k
> operations/second with a 50x delay, and about 900k operations/second
> with a 250x delay. However, with a 250x delay, performance with
> 2-5 threads is worse than with a 50x delay.
>
> Making the code auto-tune the delay factor results in a system
> that performs well with both light and heavy lock contention,
> and should also protect against the (likely) case of the fixed
> delay factor being wrong for other hardware.
>
> The attached graph shows the performance of the multi threaded
> semaphore lock/unlock test case, with 1-24 threads, on the
> vanilla kernel, with 10x, 50x, and 250x proportional delay,
> as well as the v1 patch series with autotuning for 2x and 2.7x
> spinning before the lock is obtained, and with the v2 series.
>
> The v2 series integrates several ideas from Michel Lespinasse
> and Eric Dumazet, which should result in better throughput and
> nicer behaviour in situations with contention on multiple locks.
>
> For the v3 series, I tried out all the ideas suggested by
> Michel. They made perfect sense, but in the end it turned
> out they did not work as well as the simple, aggressive
> "try to make the delay longer" policy I have now. Several
> small bug fixes and cleanups have been integrated.
>
> Performance is within the margin of error of v2, so the graph
> has not been update.
>
> Please let me know if you manage to break this code in any way,
> so I can fix it...
>

Patch series does not show anymore weird behaviour because of the
underflow (pointed by Michael) and looks fine.

I ran kernbench on 32 core (mx3850) machine with 3.8-rc2 base.
x base_3.8rc2
+ rik_backoff
     N           Min           Max        Median           Avg        Stddev
x   8       222.977        231.16       227.735       227.388     3.1512986
+   8        218.75       232.347      229.1035     228.25425     4.2730225
No difference proven at 95.0% confidence

The run did not show much difference. But I believe a spinlock stress
test would have shown the benefit.
I 'll start running benchmarks now on kvm guests and comeback with report.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
  2013-01-09 12:50 ` [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Raghavendra K T
@ 2013-01-10  2:27   ` Rafael Aquini
  2013-01-10 17:36     ` Raghavendra K T
  0 siblings, 1 reply; 27+ messages in thread
From: Rafael Aquini @ 2013-01-10  2:27 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Rik van Riel, linux-kernel, walken, eric.dumazet, lwoodman,
	jeremy, Jan Beulich, knoel, chegu_vinod, mingo

On Wed, Jan 09, 2013 at 06:20:35PM +0530, Raghavendra K T wrote:
> I ran kernbench on 32 core (mx3850) machine with 3.8-rc2 base.
> x base_3.8rc2
> + rik_backoff
>     N           Min           Max        Median           Avg        Stddev
> x   8       222.977        231.16       227.735       227.388     3.1512986
> +   8        218.75       232.347      229.1035     228.25425     4.2730225
> No difference proven at 95.0% confidence

I got similar results on smaller systems (1 socket, dual-cores and quad-cores)
when running Rik's latest series, no big difference for good nor for worse,
but I also think Rik's work is meant to address bigger systems with more cores
contending for any given spinlock.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/5] x86,smp: proportional backoff for ticket spinlocks
  2013-01-08 22:32 ` [PATCH 2/5] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
  2013-01-08 22:50   ` Eric Dumazet
@ 2013-01-10  2:30   ` Rafael Aquini
  1 sibling, 0 replies; 27+ messages in thread
From: Rafael Aquini @ 2013-01-10  2:30 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, walken, eric.dumazet, lwoodman, jeremy, Jan Beulich,
	knoel, chegu_vinod, raghavendra.kt, mingo

On Tue, Jan 08, 2013 at 05:32:41PM -0500, Rik van Riel wrote:
> Subject: x86,smp: proportional backoff for ticket spinlocks
> 
> Simple fixed value proportional backoff for ticket spinlocks.
> By pounding on the cacheline with the spin lock less often,
> bus traffic is reduced. In cases of a data structure with
> embedded spinlock, the lock holder has a better chance of
> making progress.
> 
> If we are next in line behind the current holder of the
> lock, we do a fast spin, so as not to waste any time when
> the lock is released.
> 
> The number 50 is likely to be wrong for many setups, and
> this patch is mostly to illustrate the concept of proportional
> backup. The next patch automatically tunes the delay value.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Michel Lespinasse <walken@google.com>
> ---

Acked-by: Rafael Aquini <aquini@redhat.com>


>  arch/x86/kernel/smp.c |   23 ++++++++++++++++++++---
>  1 files changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index 20da354..aa743e9 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -117,11 +117,28 @@ static bool smp_no_nmi_ipi = false;
>   */
>  void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>  {
> +	__ticket_t head = inc.head, ticket = inc.tail;
> +	__ticket_t waiters_ahead;
> +	unsigned loops;
> +
>  	for (;;) {
> -		cpu_relax();
> -		inc.head = ACCESS_ONCE(lock->tickets.head);
> +		waiters_ahead = ticket - head - 1;
> +		/*
> +		 * We are next after the current lock holder. Check often
> +		 * to avoid wasting time when the lock is released.
> +		 */
> +		if (!waiters_ahead) {
> +			do {
> +				cpu_relax();
> +			} while (ACCESS_ONCE(lock->tickets.head) != ticket);
> +			break;
> +		}
> +		loops = 50 * waiters_ahead;
> +		while (loops--)
> +			cpu_relax();
>  
> -		if (inc.head == inc.tail)
> +		head = ACCESS_ONCE(lock->tickets.head);
> +		if (head == ticket)
>  			break;
>  	}
>  }
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/5] x86,smp: auto tune spinlock backoff delay factor
  2013-01-08 22:30 ` [PATCH 3/5] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
@ 2013-01-10  3:13   ` Rafael Aquini
  2013-01-10 12:49   ` Michel Lespinasse
  1 sibling, 0 replies; 27+ messages in thread
From: Rafael Aquini @ 2013-01-10  3:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, walken, eric.dumazet, lwoodman, jeremy, Jan Beulich,
	knoel, chegu_vinod, raghavendra.kt, mingo

On Tue, Jan 08, 2013 at 05:30:29PM -0500, Rik van Riel wrote:
> Many spinlocks are embedded in data structures; having many CPUs
> pounce on the cache line the lock is in will slow down the lock
> holder, and can cause system performance to fall off a cliff.
> 
> The paper "Non-scalable locks are dangerous" is a good reference:
> 
> 	http://pdos.csail.mit.edu/papers/linux:lock.pdf
> 
> In the Linux kernel, spinlocks are optimized for the case of
> there not being contention. After all, if there is contention,
> the data structure can be improved to reduce or eliminate
> lock contention.
> 
> Likewise, the spinlock API should remain simple, and the
> common case of the lock not being contended should remain
> as fast as ever.
> 
> However, since spinlock contention should be fairly uncommon,
> we can add functionality into the spinlock slow path that keeps
> system performance from falling off a cliff when there is lock
> contention.
> 
> Proportional delay in ticket locks is delaying the time between
> checking the ticket based on a delay factor, and the number of
> CPUs ahead of us in the queue for this lock. Checking the lock
> less often allows the lock holder to continue running, resulting
> in better throughput and preventing performance from dropping
> off a cliff.
> 
> Proportional spinlock delay with a high delay factor works well
> when there is lots contention on a lock. Likewise, a smaller
> delay factor works well when a lock is lightly contended.
> 
> Making the code auto-tune the delay factor results in a system
> that performs well with both light and heavy lock contention.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
> v3: use fixed-point math for the delay calculations, suggested by Michel Lespinasse
>

Acked-by: Rafael Aquini <aquini@redhat.com>

 
>  arch/x86/kernel/smp.c |   43 +++++++++++++++++++++++++++++++++++++++----
>  1 files changed, 39 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index aa743e9..05f828b 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -113,13 +113,34 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
>  static bool smp_no_nmi_ipi = false;
>  
>  /*
> - * Wait on a congested ticket spinlock.
> + * Wait on a congested ticket spinlock. Many spinlocks are embedded in
> + * data structures; having many CPUs pounce on the cache line with the
> + * spinlock simultaneously can slow down the lock holder, and the system
> + * as a whole.
> + *
> + * To prevent total performance collapse in case of bad spinlock contention,
> + * perform proportional backoff. The per-cpu value of delay is automatically
> + * tuned to limit the number of times spinning CPUs poll the lock before
> + * obtaining it. This limits the amount of cross-CPU traffic required to obtain
> + * a spinlock, and keeps system performance from dropping off a cliff.
> + *
> + * There is a tradeoff. If we poll too often, the whole system is slowed
> + * down. If we sleep too long, the lock will go unused for a period of
> + * time. The solution is to go for a fast spin if we are at the head of
> + * the queue, to slowly increase the delay if we sleep for too short a
> + * time, and to decrease the delay if we slept for too long.
>   */
> +#define DELAY_SHIFT 8
> +#define DELAY_FIXED_1 (1<<DELAY_SHIFT)
> +#define MIN_SPINLOCK_DELAY (1 * DELAY_FIXED_1)
> +#define MAX_SPINLOCK_DELAY (16000 * DELAY_FIXED_1)
> +DEFINE_PER_CPU(unsigned, spinlock_delay) = { MIN_SPINLOCK_DELAY };
>  void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>  {
>  	__ticket_t head = inc.head, ticket = inc.tail;
>  	__ticket_t waiters_ahead;
> -	unsigned loops;
> +	unsigned delay = __this_cpu_read(spinlock_delay);
> +	unsigned loops = 1;
>  
>  	for (;;) {
>  		waiters_ahead = ticket - head - 1;
> @@ -133,14 +154,28 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>  			} while (ACCESS_ONCE(lock->tickets.head) != ticket);
>  			break;
>  		}
> -		loops = 50 * waiters_ahead;
> +
> +		/* Aggressively increase delay, to minimize lock accesses. */
> +		if (delay < MAX_SPINLOCK_DELAY)
> +			delay += DELAY_FIXED_1 / 7;
> +
> +		loops = (delay * waiters_ahead) >> DELAY_SHIFT;
>  		while (loops--)
>  			cpu_relax();
>  
>  		head = ACCESS_ONCE(lock->tickets.head);
> -		if (head == ticket)
> +		if (head == ticket) {
> +			/*
> +			 * We overslept, and do not know by how.
> +			 * Exponentially decay the value of delay,
> +			 * to get it back to a good value quickly.
> +			 */
> +			if (delay >= 2 * DELAY_FIXED_1)
> +				delay -= max(delay/32, DELAY_FIXED_1);
>  			break;
> +		}
>  	}
> +	__this_cpu_write(spinlock_delay, delay);
>  }
>  
>  /*
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address
  2013-01-08 22:31 ` [PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address Rik van Riel
@ 2013-01-10  3:14   ` Rafael Aquini
  2013-01-10 13:01   ` Michel Lespinasse
  1 sibling, 0 replies; 27+ messages in thread
From: Rafael Aquini @ 2013-01-10  3:14 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, walken, eric.dumazet, lwoodman, jeremy, Jan Beulich,
	knoel, chegu_vinod, raghavendra.kt, mingo

On Tue, Jan 08, 2013 at 05:31:19PM -0500, Rik van Riel wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Eric Dumazet found a regression with the first version of the spinlock
> backoff code, in a workload where multiple spinlocks were contended,
> each having a different wait time.
> 
> This patch has multiple delay values per cpu, indexed on a hash
> of the lock address, to avoid that problem.
> 
> Eric Dumazet wrote:
> 
> I did some tests with your patches with following configuration :
> 
> tc qdisc add dev eth0 root htb r2q 1000 default 3
> (to force a contention on qdisc lock, even with a multi queue net
> device)
> 
> and 24 concurrent "netperf -t UDP_STREAM -H other_machine -- -m 128"
> 
> Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
> (24 threads), and a fast NIC (10Gbps)
> 
> Resulting in a 13 % regression (676 Mbits -> 595 Mbits)
> 
> In this workload we have at least two contended spinlocks, with
> different delays. (spinlocks are not held for the same duration)
> 
> It clearly defeats your assumption of a single per cpu delay being OK :
> Some cpus are spinning too long while the lock was released.
> 
> We might try to use a hash on lock address, and an array of 16 different
> delays so that different spinlocks have a chance of not sharing the same
> delay.
> 
> With following patch, I get 982 Mbits/s with same bench, so an increase
> of 45 % instead of a 13 % regression.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---

Acked-by: Rafael Aquini <aquini@redhat.com>


>  arch/x86/kernel/smp.c |   22 +++++++++++++++++++---
>  1 files changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index 05f828b..1877890 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -23,6 +23,7 @@
>  #include <linux/interrupt.h>
>  #include <linux/cpu.h>
>  #include <linux/gfp.h>
> +#include <linux/hash.h>
>  
>  #include <asm/mtrr.h>
>  #include <asm/tlbflush.h>
> @@ -134,12 +135,26 @@ static bool smp_no_nmi_ipi = false;
>  #define DELAY_FIXED_1 (1<<DELAY_SHIFT)
>  #define MIN_SPINLOCK_DELAY (1 * DELAY_FIXED_1)
>  #define MAX_SPINLOCK_DELAY (16000 * DELAY_FIXED_1)
> -DEFINE_PER_CPU(unsigned, spinlock_delay) = { MIN_SPINLOCK_DELAY };
> +#define DELAY_HASH_SHIFT 6
> +struct delay_entry {
> +	u32 hash;
> +	u32 delay;
> +};
> +static DEFINE_PER_CPU(struct delay_entry [1 << DELAY_HASH_SHIFT], spinlock_delay) = {
> +	[0 ... (1 << DELAY_HASH_SHIFT) - 1] = {
> +		.hash = 0,
> +		.delay = MIN_SPINLOCK_DELAY,
> +	},
> +};
> +
>  void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>  {
>  	__ticket_t head = inc.head, ticket = inc.tail;
>  	__ticket_t waiters_ahead;
> -	unsigned delay = __this_cpu_read(spinlock_delay);
> +	u32 hash = hash32_ptr(lock);
> +	u32 slot = hash_32(hash, DELAY_HASH_SHIFT);
> +	struct delay_entry *ent = &__get_cpu_var(spinlock_delay[slot]);
> +	u32 delay = (ent->hash == hash) ? ent->delay : MIN_SPINLOCK_DELAY;
>  	unsigned loops = 1;
>  
>  	for (;;) {
> @@ -175,7 +190,8 @@ void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>  			break;
>  		}
>  	}
> -	__this_cpu_write(spinlock_delay, delay);
> +	ent->hash = hash;
> +	ent->delay = delay;
>  }
>  
>  /*
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/5] x86,smp: auto tune spinlock backoff delay factor
  2013-01-08 22:30 ` [PATCH 3/5] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
  2013-01-10  3:13   ` Rafael Aquini
@ 2013-01-10 12:49   ` Michel Lespinasse
  1 sibling, 0 replies; 27+ messages in thread
From: Michel Lespinasse @ 2013-01-10 12:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, eric.dumazet, lwoodman, jeremy, Jan Beulich,
	knoel, chegu_vinod, raghavendra.kt, mingo

On Tue, Jan 8, 2013 at 2:30 PM, Rik van Riel <riel@redhat.com> wrote:
> v3: use fixed-point math for the delay calculations, suggested by Michel Lespinasse
>
> -               if (head == ticket)
> +               if (head == ticket) {
> +                       /*
> +                        * We overslept, and do not know by how.
> +                        * Exponentially decay the value of delay,
> +                        * to get it back to a good value quickly.
> +                        */
> +                       if (delay >= 2 * DELAY_FIXED_1)
> +                               delay -= max(delay/32, DELAY_FIXED_1);
>                         break;
> +               }

I think delay -= (delay - MIN_SPINLOCK_DELAY) / 32 would work too, and
avoid the conditions here.

It won't make a difference in practice because delay > 32 *
DELAY_FIXED_1 on all machines I tested (even for the shortest possible
do-nothing spinlocked regions)

Also - I think we could make MIN_SPINLOCK_DELAY a tunable. There is a
limit to how fast the lock cacheline can bounce from one waiter to the
next, and we know that this limit isn't DELAY_FIXED_1. Using
DELAY_FIXED_1 is OK when we don't know of a better lower bound for a
given system, but I think it'd be nice if one could set the correct
value if they know it.

Looking at the bigger picture:

The tests I've done haven't shown any regressions over baseline. Your
v1 proposal had a tendency to oversleep but this has been fixed in v2
and v3. The only risk I can foresee, and I think this hasn't been
tested enough, would be that the algorithm might oversleep when it
encounters a mix of short and long held spinlocks. I think the proper
behavior when we encounter this is that we want to be conservative and
use the shortest encountered spinlock as our delay in that case. There
is a strong theorical case that additive increase/exponential decay
should achieve that, but I'd like to see it tested in practice.

Trying to lay down the argument why additive increase / exponential
decay should work here:
- When we oversleep, we spin (waiters_ahead * delay / DELAY_FIXED_1)
iterations (with waiters_ahead < NR_CPUS) and update new_delay = delay
* (31 / 32)
- If we keep oversleeping, delay will converge towards 0 so that the
sum of all overslept delay values will be original_delay *
sum(i=0...infinity, (31/32)^i) which is 32 * original_delay. So, the
total number of oversleeping iterations is bounded by original_delay *
(NR_CPUS * 32 / DELAY_FIXED_1)
- Each increase of delay by DELAY_FIXED_1 / 7 can thus be seen as
having a potential worst-case future cost of (NR_CPUS * 32 / 7)
oversleeping iterations.
I like that this last value is at least bounded, but it's still high
enough that I think actual behavior in the face of varying hold
durations should be investigated.

As for test results:
running 32 instances of netperf -t UDP_STREAM -H <server> -- -m 128 on
a 32 hypercores machine (2 sandybridge sockets) and htb root qdisc:
I am seeing ~650 to ~680 Mbps total with baseline 3.7, and ~860 to
~890 Mbps total with patches 1-3 applied.
(patch 4 makes the results a bit more consistent, in the ~880 to ~890
Mbps range)

Reviewed-by: Michel Lespinasse <walken@google.com>

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address
  2013-01-08 22:31 ` [PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address Rik van Riel
  2013-01-10  3:14   ` Rafael Aquini
@ 2013-01-10 13:01   ` Michel Lespinasse
  2013-01-10 13:05     ` Rik van Riel
  1 sibling, 1 reply; 27+ messages in thread
From: Michel Lespinasse @ 2013-01-10 13:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, eric.dumazet, lwoodman, jeremy, Jan Beulich,
	knoel, chegu_vinod, raghavendra.kt, mingo

On Tue, Jan 8, 2013 at 2:31 PM, Rik van Riel <riel@redhat.com> wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
>
> Eric Dumazet found a regression with the first version of the spinlock
> backoff code, in a workload where multiple spinlocks were contended,
> each having a different wait time.
>
> This patch has multiple delay values per cpu, indexed on a hash
> of the lock address, to avoid that problem.
>
> Eric Dumazet wrote:
>
> I did some tests with your patches with following configuration :
>
> tc qdisc add dev eth0 root htb r2q 1000 default 3
> (to force a contention on qdisc lock, even with a multi queue net
> device)
>
> and 24 concurrent "netperf -t UDP_STREAM -H other_machine -- -m 128"
>
> Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
> (24 threads), and a fast NIC (10Gbps)
>
> Resulting in a 13 % regression (676 Mbits -> 595 Mbits)
>
> In this workload we have at least two contended spinlocks, with
> different delays. (spinlocks are not held for the same duration)
>
> It clearly defeats your assumption of a single per cpu delay being OK :
> Some cpus are spinning too long while the lock was released.
>
> We might try to use a hash on lock address, and an array of 16 different
> delays so that different spinlocks have a chance of not sharing the same
> delay.
>
> With following patch, I get 982 Mbits/s with same bench, so an increase
> of 45 % instead of a 13 % regression.

Note that these results were with your v1 proposal. With v3 proposal,
on a slightly different machine (2 socket sandybridge) with a similar
NIC, I am not seeing the regression when not using the hash table. I
think this is because v3 got more conservative about mixed spinlock
hold times, and converges towards the shortest of the hold times in
that case.

> -       unsigned delay = __this_cpu_read(spinlock_delay);
> +       u32 hash = hash32_ptr(lock);
> +       u32 slot = hash_32(hash, DELAY_HASH_SHIFT);
> +       struct delay_entry *ent = &__get_cpu_var(spinlock_delay[slot]);
> +       u32 delay = (ent->hash == hash) ? ent->delay : MIN_SPINLOCK_DELAY;

I think we should actually always use ent->delay here. My reasoning is
that I think the base algorithm (from the previous patch) should have
robustness against mixed spinlock hold times. We can't rely on
detecting hash collisions to protect us against varying hold times,
because this case could happen even with a single spinlock. So we need
to make sure the base algorithm is robust and converges towards using
the shorter of the spinlock hold times; if we have that then forcing a
reset to MIN_SPINLOCK_DELAY only hurts since it reduces the delay
below what is otherwise necessary.


Other than that:
Reviewed-by: Michel Lespinasse <walken@google.com>

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address
  2013-01-10 13:01   ` Michel Lespinasse
@ 2013-01-10 13:05     ` Rik van Riel
  2013-01-10 13:15       ` Michel Lespinasse
  0 siblings, 1 reply; 27+ messages in thread
From: Rik van Riel @ 2013-01-10 13:05 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-kernel, aquini, eric.dumazet, lwoodman, jeremy, Jan Beulich,
	knoel, chegu_vinod, raghavendra.kt, mingo

On 01/10/2013 08:01 AM, Michel Lespinasse wrote:
> On Tue, Jan 8, 2013 at 2:31 PM, Rik van Riel <riel@redhat.com> wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>>
>> Eric Dumazet found a regression with the first version of the spinlock
>> backoff code, in a workload where multiple spinlocks were contended,
>> each having a different wait time.
>>
>> This patch has multiple delay values per cpu, indexed on a hash
>> of the lock address, to avoid that problem.
>>
>> Eric Dumazet wrote:
>>
>> I did some tests with your patches with following configuration :
>>
>> tc qdisc add dev eth0 root htb r2q 1000 default 3
>> (to force a contention on qdisc lock, even with a multi queue net
>> device)
>>
>> and 24 concurrent "netperf -t UDP_STREAM -H other_machine -- -m 128"
>>
>> Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
>> (24 threads), and a fast NIC (10Gbps)
>>
>> Resulting in a 13 % regression (676 Mbits -> 595 Mbits)
>>
>> In this workload we have at least two contended spinlocks, with
>> different delays. (spinlocks are not held for the same duration)
>>
>> It clearly defeats your assumption of a single per cpu delay being OK :
>> Some cpus are spinning too long while the lock was released.
>>
>> We might try to use a hash on lock address, and an array of 16 different
>> delays so that different spinlocks have a chance of not sharing the same
>> delay.
>>
>> With following patch, I get 982 Mbits/s with same bench, so an increase
>> of 45 % instead of a 13 % regression.
>
> Note that these results were with your v1 proposal. With v3 proposal,
> on a slightly different machine (2 socket sandybridge) with a similar
> NIC, I am not seeing the regression when not using the hash table. I
> think this is because v3 got more conservative about mixed spinlock
> hold times, and converges towards the shortest of the hold times in
> that case.

Eric,

with just patches 1-3, can you still reproduce the
regression on your system?

In other words, could we get away with dropping the
complexity of patch 4, or do we still need it?


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address
  2013-01-10 13:05     ` Rik van Riel
@ 2013-01-10 13:15       ` Michel Lespinasse
  0 siblings, 0 replies; 27+ messages in thread
From: Michel Lespinasse @ 2013-01-10 13:15 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, eric.dumazet, lwoodman, jeremy, Jan Beulich,
	knoel, chegu_vinod, raghavendra.kt, mingo

On Thu, Jan 10, 2013 at 5:05 AM, Rik van Riel <riel@redhat.com> wrote:
> Eric,
>
> with just patches 1-3, can you still reproduce the
> regression on your system?
>
> In other words, could we get away with dropping the
> complexity of patch 4, or do we still need it?

To be clear, I must say that I'm not opposing patch 4 per se. I think
we should not rely on it to avoid regressions, as patch 3 needs to be
robust enough to do that on its own. However, it may very well be that
having different constants for each lock (or for each hash bucket as a
proxy) helps - if lock B has a consistently longer hold time than lock
A, having them in separate hash buckets will allow us to use optimal
tunings for both, but if they collide or if we don't have a hash
table, we'll use a delay that is close to A's value for both, which is
safe (shouldn't introduce regressions) but not optimal.

In other words, I really don't want us to depend on the hash table for
robustness but I think it's fine to have it for extra performance (as
it's actually very short)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
  2013-01-08 22:26 [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
                   ` (5 preceding siblings ...)
  2013-01-09 12:50 ` [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Raghavendra K T
@ 2013-01-10 15:19 ` Mike Galbraith
  2013-01-10 15:31   ` Rik van Riel
  2013-01-10 22:24 ` Chegu Vinod
  7 siblings, 1 reply; 27+ messages in thread
From: Mike Galbraith @ 2013-01-10 15:19 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, walken, eric.dumazet, lwoodman, jeremy,
	Jan Beulich, knoel, chegu_vinod, raghavendra.kt, mingo

On Tue, 2013-01-08 at 17:26 -0500, Rik van Riel wrote:

> Please let me know if you manage to break this code in any way,
> so I can fix it...

I didn't break it, but did let it play with rq->lock contention.  Using
cyclictest -Smp99 -i 100 -d 0, with 3 rt tasks for pull_rt_task() to
pull around appears to have been a ~dead heat.

              3.6.11                                    3.6.11-spinlock

   PerfTop:   78852 irqs/sec  kernel:96.4%  exact:  0.0% [1000Hz cycles],  (all, 80 CPUs)
-------------------------------------------------------------------------------------------------

             samples  pcnt function                     samples  pcnt function
             _______ _____ ___________________________  _______ _____ ___________________________

           468341.00 52.0% cpupri_set                  471786.00 52.0% cpupri_set
           110259.00 12.2% _raw_spin_lock               88963.00  9.8% ticket_spin_lock_wait
            78863.00  8.8% native_write_msr_safe        77109.00  8.5% native_write_msr_safe
            42882.00  4.8% __schedule                   48858.00  5.4% native_write_cr0
            40930.00  4.5% native_write_cr0             47038.00  5.2% __schedule
            13718.00  1.5% finish_task_switch           24775.00  2.7% _raw_spin_lock
            13188.00  1.5% plist_del                    13117.00  1.4% plist_del
            13078.00  1.5% _raw_spin_lock_irqsave       12372.00  1.4% ttwu_do_wakeup
            12083.00  1.3% ttwu_do_wakeup               11553.00  1.3% _raw_spin_lock_irqsave
             8359.00  0.9% pull_rt_task                  8186.00  0.9% pull_rt_task
             6979.00  0.8% apic_timer_interrupt          7989.00  0.9% finish_task_switch
             4623.00  0.5% __enqueue_rt_entity           6430.00  0.7% apic_timer_interrupt
             3961.00  0.4% resched_task                  4721.00  0.5% resched_task
             3942.00  0.4% __switch_to                   4109.00  0.5% __switch_to
             3128.00  0.3% _raw_spin_trylock             2917.00  0.3% rcu_idle_exit_common
             3081.00  0.3% __tick_nohz_idle_enter        2897.00  0.3% __local_bh_enable
             2561.00  0.3% update_curr_rt                2873.00  0.3% _raw_spin_trylock
             2385.00  0.3% _raw_spin_lock_irq            2674.00  0.3% __enqueue_rt_entity
             2190.00  0.2% __local_bh_enable             2434.00  0.3% update_curr_rt
             1904.00  0.2% rcu_idle_exit_common          2161.00  0.2% hrtimer_interrupt
             1870.00  0.2% clockevents_program_event     2106.00  0.2% ktime_get_update_offsets
             1828.00  0.2% hrtimer_interrupt             1766.00  0.2% tick_nohz_idle_exit
             1741.00  0.2% do_nanosleep                  1608.00  0.2% __tick_nohz_idle_enter
             1681.00  0.2% sys_clock_nanosleep           1437.00  0.2% do_nanosleep
             1639.00  0.2% pick_next_task_rt             1428.00  0.2% hrtimer_init
             1630.00  0.2% pick_next_task_stop           1320.00  0.1% sched_clock_idle_sleep_event
             1535.00  0.2% _raw_spin_unlock_irqrestore   1290.00  0.1% sys_clock_nanosleep


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
  2013-01-10 15:19 ` Mike Galbraith
@ 2013-01-10 15:31   ` Rik van Riel
  2013-01-10 19:30     ` Mike Galbraith
  0 siblings, 1 reply; 27+ messages in thread
From: Rik van Riel @ 2013-01-10 15:31 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, aquini, walken, eric.dumazet, lwoodman, jeremy,
	Jan Beulich, knoel, chegu_vinod, raghavendra.kt, mingo

On 01/10/2013 10:19 AM, Mike Galbraith wrote:
> On Tue, 2013-01-08 at 17:26 -0500, Rik van Riel wrote:
>
>> Please let me know if you manage to break this code in any way,
>> so I can fix it...
>
> I didn't break it, but did let it play with rq->lock contention.  Using
> cyclictest -Smp99 -i 100 -d 0, with 3 rt tasks for pull_rt_task() to
> pull around appears to have been a ~dead heat.

Good to hear that the code seems to be robust. It seems to
help prevent performance degradation in some workloads, and
nobody seems to have found regressions yet.

Thank you for testing.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
  2013-01-10  2:27   ` Rafael Aquini
@ 2013-01-10 17:36     ` Raghavendra K T
  2013-01-11 20:11       ` Rik van Riel
  0 siblings, 1 reply; 27+ messages in thread
From: Raghavendra K T @ 2013-01-10 17:36 UTC (permalink / raw)
  To: Rafael Aquini, Rik van Riel
  Cc: Raghavendra K T, linux-kernel, walken, eric.dumazet, lwoodman,
	jeremy, Jan Beulich, knoel, chegu_vinod, mingo

* Rafael Aquini <aquini@redhat.com> [2013-01-10 00:27:23]:

> On Wed, Jan 09, 2013 at 06:20:35PM +0530, Raghavendra K T wrote:
> > I ran kernbench on 32 core (mx3850) machine with 3.8-rc2 base.
> > x base_3.8rc2
> > + rik_backoff
> >     N           Min           Max        Median           Avg        Stddev
> > x   8       222.977        231.16       227.735       227.388     3.1512986
> > +   8        218.75       232.347      229.1035     228.25425     4.2730225
> > No difference proven at 95.0% confidence
> 
> I got similar results on smaller systems (1 socket, dual-cores and quad-cores)
> when running Rik's latest series, no big difference for good nor for worse,
> but I also think Rik's work is meant to address bigger systems with more cores
> contending for any given spinlock.

I was able to do the test on same 32 core machine with
4 guests (8GB RAM, 32 vcpu).
Here are the results

base = 3.8-rc2
patched =  base + Rik V3 backoff series [patch 1-4]

+-----------+-----------+-----------+------------+-----------+
                kernbench (sec lower is better) 
+-----------+-----------+-----------+------------+-----------+
    base        stdev        patched      stdev     %improve
+-----------+-----------+-----------+------------+-----------+
    44.3000     2.0404	    46.7928     1.7518	  -5.62709
    94.8262     5.1444	   102.4737     7.8406	  -8.06475
   156.0540    14.5797	   167.6888     9.7110	  -7.45562
   202.3225    15.8906	   213.1435    17.1778	  -5.34839
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
               sysbench (sec lower is better)
+-----------+-----------+-----------+------------+-----------+
    base        stdev        patched      stdev     %improve
+-----------+-----------+-----------+------------+-----------+
    16.8512     0.4164	    17.7301     0.3761	  -5.21565
    13.0411     0.4115	    12.9380     0.1554	   0.79058
    18.4573     0.2123	    18.4662     0.2005	  -0.04822
    24.2021     0.1713	    24.3690     0.3270	  -0.68961
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
               ebizzy (record/sec higher is better)
+-----------+-----------+-----------+------------+-----------+
    base        stdev        patched      stdev     %improve
+-----------+-----------+-----------+------------+-----------+
  2494.4000    27.5447	  2400.6000    83.4255	  -3.76042
  2636.6000   302.9658	  2757.5000   147.5137	   4.58545
  2236.8333   239.6531	  2131.6667   156.1534	  -4.70158
  1768.8750   142.5437	  1901.3750   295.2147	   7.49064
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
              dbench (throughput in MB/sec higher is better)
+-----------+-----------+-----------+------------+-----------+
    base        stdev        patched      stdev     %improve
+-----------+-----------+-----------+------------+-----------+
 10076.9180  2410.9655	  5870.7460  4297.4532	  xxxxxxx
  2152.5220    88.2853	  1517.8270    61.9742	 -29.48611
  1334.9608    34.3247	  1078.4275    38.2288	 -19.21654
   946.6355    32.0426	   753.0757    25.5302	 -20.44713
+-----------+-----------+-----------+------------+-----------+

Please note that I have put dbench_1x result as xxxx since I observed
very high variance in the result.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/5] x86,smp: move waiting on contended ticket lock out of line
  2013-01-08 22:32 ` [PATCH 1/5] x86,smp: move waiting on contended ticket lock out of line Rik van Riel
  2013-01-08 22:43   ` Eric Dumazet
@ 2013-01-10 17:38   ` Raghavendra K T
  1 sibling, 0 replies; 27+ messages in thread
From: Raghavendra K T @ 2013-01-10 17:38 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, walken, eric.dumazet, lwoodman, jeremy,
	Jan Beulich, knoel, chegu_vinod, raghavendra.kt, mingo

* Rik van Riel <riel@redhat.com> [2013-01-08 17:32:52]:

> Moving the wait loop for congested loops to its own function allows
> us to add things to that wait loop, without growing the size of the
> kernel text appreciably.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: Steven Rostedt <rostedt@goodmiss.org>
> Reviewed-by: Michel Lespinasse <walken@google.com>
> Reviewed-by: Rafael Aquini <aquini@redhat.com>

Reviewed-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
  2013-01-10 15:31   ` Rik van Riel
@ 2013-01-10 19:30     ` Mike Galbraith
  2013-01-24 13:28       ` Ingo Molnar
  0 siblings, 1 reply; 27+ messages in thread
From: Mike Galbraith @ 2013-01-10 19:30 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, walken, eric.dumazet, lwoodman, jeremy,
	Jan Beulich, knoel, chegu_vinod, raghavendra.kt, mingo

On Thu, 2013-01-10 at 10:31 -0500, Rik van Riel wrote: 
> On 01/10/2013 10:19 AM, Mike Galbraith wrote:
> > On Tue, 2013-01-08 at 17:26 -0500, Rik van Riel wrote:
> >
> >> Please let me know if you manage to break this code in any way,
> >> so I can fix it...
> >
> > I didn't break it, but did let it play with rq->lock contention.  Using
> > cyclictest -Smp99 -i 100 -d 0, with 3 rt tasks for pull_rt_task() to
> > pull around appears to have been a ~dead heat.
> 
> Good to hear that the code seems to be robust. It seems to
> help prevent performance degradation in some workloads, and
> nobody seems to have found regressions yet.

I had hoped for a bit of positive, but a wash isn't surprising given the
profile.  I tried tbench too, didn't expect to see anything at all
there, and got that.. so both results are positive in that respect.

-Mike


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
  2013-01-08 22:26 [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
                   ` (6 preceding siblings ...)
  2013-01-10 15:19 ` Mike Galbraith
@ 2013-01-10 22:24 ` Chegu Vinod
  7 siblings, 0 replies; 27+ messages in thread
From: Chegu Vinod @ 2013-01-10 22:24 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, walken, eric.dumazet, lwoodman, jeremy,
	Jan Beulich, knoel, raghavendra.kt, mingo

[-- Attachment #1: Type: text/plain, Size: 976 bytes --]

On 1/8/2013 2:26 PM, Rik van Riel wrote:
<...>
> Performance is within the margin of error of v2, so the graph
> has not been update.
>
> Please let me know if you manage to break this code in any way,
> so I can fix it...
>

Attached below is some preliminary data with one of the AIM7 micro-benchmark
workloads (i.e. high_systime). This is a kernel intensive workload which
does tons of forks/execs etc.and stresses quite a few of the same set
of spinlocks and semaphores.

Observed a drop in performance as we go to 40way and 80 way. Wondering
if the back off keeps increasing to such an extent that it actually starts
to hurt given the nature of this workload ?  Also in the case of 80way
observed quite a bit of variation from run to run...

Also ran it inside a single KVM guest. There were some perf. dips but
interestingly didn't observe the same level of drop (compared to the
drop in the native case) as the guest size was scaled up to 40vcpu or
80vcpu.

FYI
Vinod


[-- Attachment #2: aim7_rik --]
[-- Type: text/plain, Size: 2333 bytes --]



---

Platform : 8 socket (80 Core) Westmere with 1TB RAM.

Workload: AIM7-highsystime microbenchmark - 2000 users & 100 jobs per user.  

Values reported are Jobs Per Minute (Higher is better).  The values
are average of 3 runs.

1) Native run:
--------------

Config 1:  3.7 kernel
Config 2:  3.7 + Rik's 1-4 patches

------------------------------------------------------------
              20way     40way     80way
------------------------------------------------------------
Config 1     ~179K     ~159K     ~146K 
------------------------------------------------------------
Config 2     ~180K     ~134K     ~21K-43K  <- high variation!
------------------------------------------------------------

(Note: Used numactl to restrict workload to 
            2 sockets (20way) and 4 sockets(40way))

------

2) KVM run : 
------------

Single guest of different sizes (No over commit, NUMA enabled in the guest).

Note: This kernel intensive micro benchmark is exposes the PLE handler issue 
      esp. for large guests. Since Raghu's PLE changes are not yet in upstream 
      'have just run with current PLE handler & then by disabling 
      PLE (ple_gap=0).

Config 1 : Host & Guest at 3.7
Config 2 : Host & Guest are at 3.7 + Rik's 1-4 patches

--------------------------------------------------------------------------
             20vcpu/128G      40vcpu/256G      80vcpu/512G
            (on 2 sockets)   (on 4 sockets)   (on 8 sockets)
--------------------------------------------------------------------------
Config 1       ~144K             ~39K             ~10K
--------------------------------------------------------------------------
Config 2       ~143K             ~37.5K           ~11K
--------------------------------------------------------------------------

Config 3 : Host & Guest at 3.7 AND ple_gap=0
Config 4 : Host & Guest are at 3.7 + Rik's 1-4 patches AND ple_gap=0

--------------------------------------------------------------------------
Config 3       ~154K            ~131K            ~116K 
--------------------------------------------------------------------------
Config 4       ~151K            ~130K            ~115K
--------------------------------------------------------------------------


(Note: Used numactl to restrict qemu to 
            2 sockets (20way) and 4 sockets(40way))

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
  2013-01-10 17:36     ` Raghavendra K T
@ 2013-01-11 20:11       ` Rik van Riel
  2013-01-13 18:07         ` Raghavendra K T
  0 siblings, 1 reply; 27+ messages in thread
From: Rik van Riel @ 2013-01-11 20:11 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Rafael Aquini, linux-kernel, walken, eric.dumazet, lwoodman,
	jeremy, Jan Beulich, knoel, chegu_vinod, mingo

On 01/10/2013 12:36 PM, Raghavendra K T wrote:
> * Rafael Aquini <aquini@redhat.com> [2013-01-10 00:27:23]:
>
>> On Wed, Jan 09, 2013 at 06:20:35PM +0530, Raghavendra K T wrote:
>>> I ran kernbench on 32 core (mx3850) machine with 3.8-rc2 base.
>>> x base_3.8rc2
>>> + rik_backoff
>>>      N           Min           Max        Median           Avg        Stddev
>>> x   8       222.977        231.16       227.735       227.388     3.1512986
>>> +   8        218.75       232.347      229.1035     228.25425     4.2730225
>>> No difference proven at 95.0% confidence
>>
>> I got similar results on smaller systems (1 socket, dual-cores and quad-cores)
>> when running Rik's latest series, no big difference for good nor for worse,
>> but I also think Rik's work is meant to address bigger systems with more cores
>> contending for any given spinlock.
>
> I was able to do the test on same 32 core machine with
> 4 guests (8GB RAM, 32 vcpu).
> Here are the results
>
> base = 3.8-rc2
> patched =  base + Rik V3 backoff series [patch 1-4]

I believe I understand why this is happening.

Modern Intel and AMD CPUs have a feature called Pause Loop Exiting (PLE)
and Pause Filter (PF), respectively.  This feature is used to trap to
the host when the guest is spinning on a spinlock.

This allows the host to run something else, and having the spinner
temporarily yield the CPU.  Effectively, this causes the KVM code
to already do some limited amount of spinlock backoff code, in the
host.

Adding more backoff code in the guest can lead to wild delays in
acquiring locks, and generally bad performance.

I suspect that when running in a virtual machine, we should limit
the delay factor to something much smaller, since the host will take
care of most of the backoff for us.

Maybe a maximum delay value of ~10 would do the trick for KVM
guests.

We should be able to get this right by placing the value for the
maximum delay in a __read_mostly section and setting it to something
small from an init function when we detect we are running in a
virtual machine.

Let me cook up, and test, a patch that does that...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
  2013-01-11 20:11       ` Rik van Riel
@ 2013-01-13 18:07         ` Raghavendra K T
  0 siblings, 0 replies; 27+ messages in thread
From: Raghavendra K T @ 2013-01-13 18:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Rafael Aquini, linux-kernel, walken, eric.dumazet, lwoodman,
	jeremy, Jan Beulich, knoel, chegu_vinod, mingo

On 01/12/2013 01:41 AM, Rik van Riel wrote:
> On 01/10/2013 12:36 PM, Raghavendra K T wrote:
>> * Rafael Aquini <aquini@redhat.com> [2013-01-10 00:27:23]:
>>
>>> On Wed, Jan 09, 2013 at 06:20:35PM +0530, Raghavendra K T wrote:
>>>> I ran kernbench on 32 core (mx3850) machine with 3.8-rc2 base.
>>>> x base_3.8rc2
>>>> + rik_backoff
>>>>      N           Min           Max        Median
>>>> Avg        Stddev
>>>> x   8       222.977        231.16       227.735       227.388
>>>> 3.1512986
>>>> +   8        218.75       232.347      229.1035     228.25425
>>>> 4.2730225
>>>> No difference proven at 95.0% confidence
>>>
>>> I got similar results on smaller systems (1 socket, dual-cores and
>>> quad-cores)
>>> when running Rik's latest series, no big difference for good nor for
>>> worse,
>>> but I also think Rik's work is meant to address bigger systems with
>>> more cores
>>> contending for any given spinlock.
>>
>> I was able to do the test on same 32 core machine with
>> 4 guests (8GB RAM, 32 vcpu).
>> Here are the results
>>
>> base = 3.8-rc2
>> patched =  base + Rik V3 backoff series [patch 1-4]
>
> I believe I understand why this is happening.
>
> Modern Intel and AMD CPUs have a feature called Pause Loop Exiting (PLE)
> and Pause Filter (PF), respectively.  This feature is used to trap to
> the host when the guest is spinning on a spinlock.
>
> This allows the host to run something else, and having the spinner
> temporarily yield the CPU.  Effectively, this causes the KVM code
> to already do some limited amount of spinlock backoff code, in the
> host.
>
> Adding more backoff code in the guest can lead to wild delays in
> acquiring locks, and generally bad performance.

Yes agree with you.

> I suspect that when running in a virtual machine, we should limit
> the delay factor to something much smaller, since the host will take
> care of most of the backoff for us.
>

Even for non-PLE case I believe it would be difficult to tune delay,
because of VCPU scheduling and LHP.

> Maybe a maximum delay value of ~10 would do the trick for KVM
> guests.
>
> We should be able to get this right by placing the value for the
> maximum delay in a __read_mostly section and setting it to something
> small from an init function when we detect we are running in a
> virtual machine.
>
> Let me cook up, and test, a patch that does that...

Sure.. Awaiting and happy to test the patches.
I also tried few things on my own and also how it behaves without patch
4. Nothing helped.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
  2013-01-10 19:30     ` Mike Galbraith
@ 2013-01-24 13:28       ` Ingo Molnar
  0 siblings, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2013-01-24 13:28 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Rik van Riel, linux-kernel, aquini, walken, eric.dumazet,
	lwoodman, jeremy, Jan Beulich, knoel, chegu_vinod, raghavendra.kt,
	mingo


* Mike Galbraith <bitbucket@online.de> wrote:

> On Thu, 2013-01-10 at 10:31 -0500, Rik van Riel wrote: 
> > On 01/10/2013 10:19 AM, Mike Galbraith wrote:
> > > On Tue, 2013-01-08 at 17:26 -0500, Rik van Riel wrote:
> > >
> > >> Please let me know if you manage to break this code in any way,
> > >> so I can fix it...
> > >
> > > I didn't break it, but did let it play with rq->lock contention.  Using
> > > cyclictest -Smp99 -i 100 -d 0, with 3 rt tasks for pull_rt_task() to
> > > pull around appears to have been a ~dead heat.
> > 
> > Good to hear that the code seems to be robust. It seems to
> > help prevent performance degradation in some workloads, and
> > nobody seems to have found regressions yet.
> 
> I had hoped for a bit of positive, but a wash isn't surprising 
> given the profile.  I tried tbench too, didn't expect to see 
> anything at all there, and got that.. so both results are 
> positive in that respect.

Ok, that's good.

Rik, mind re-sending the latest series with all the acks and 
Reviewed-by's added?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2013-01-24 13:28 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-08 22:26 [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
2013-01-08 22:30 ` [PATCH 3/5] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
2013-01-10  3:13   ` Rafael Aquini
2013-01-10 12:49   ` Michel Lespinasse
2013-01-08 22:31 ` [PATCH 4/5] x86,smp: keep spinlock delay values per hashed spinlock address Rik van Riel
2013-01-10  3:14   ` Rafael Aquini
2013-01-10 13:01   ` Michel Lespinasse
2013-01-10 13:05     ` Rik van Riel
2013-01-10 13:15       ` Michel Lespinasse
2013-01-08 22:32 ` [DEBUG PATCH 5/5] x86,smp: add debugging code to track spinlock delay value Rik van Riel
2013-01-08 22:32 ` [PATCH 2/5] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
2013-01-08 22:50   ` Eric Dumazet
2013-01-08 22:54     ` Rik van Riel
2013-01-10  2:30   ` Rafael Aquini
2013-01-08 22:32 ` [PATCH 1/5] x86,smp: move waiting on contended ticket lock out of line Rik van Riel
2013-01-08 22:43   ` Eric Dumazet
2013-01-10 17:38   ` Raghavendra K T
2013-01-09 12:50 ` [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Raghavendra K T
2013-01-10  2:27   ` Rafael Aquini
2013-01-10 17:36     ` Raghavendra K T
2013-01-11 20:11       ` Rik van Riel
2013-01-13 18:07         ` Raghavendra K T
2013-01-10 15:19 ` Mike Galbraith
2013-01-10 15:31   ` Rik van Riel
2013-01-10 19:30     ` Mike Galbraith
2013-01-24 13:28       ` Ingo Molnar
2013-01-10 22:24 ` Chegu Vinod

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).