[RFC PATCH 0/3] x86,smp: make ticket spinlock proportional backoff w/ auto tuning

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
@ 2012-12-21 23:49 Rik van Riel
  2012-12-21 23:50 ` [RFC PATCH 1/3] x86,smp: move waiting on contended lock out of line Rik van Riel
                   ` (3 more replies)
  0 siblings, 4 replies; 53+ messages in thread
From: Rik van Riel @ 2012-12-21 23:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, walken, lwoodman, jeremy, Jan Beulich, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 2682 bytes --]

Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper "Non-scalable locks are dangerous" is a good reference:

	http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

The test case has a number of threads locking and unlocking a
semaphore. With just one thread, everything sits in the CPU
cache and throughput is around 2.6 million operations per
second, with a 5-10% variation.

Once a second thread gets involved, data structures bounce
from CPU to CPU, and performance deteriorates to about 1.25
million operations per second, with a 5-10% variation.

However, as more and more threads get added to the mix,
performance with the vanilla kernel continues to deteriorate.
Once I hit 24 threads, on a 24 CPU, 4 node test system,
performance is down to about 290k operations/second.

With a proportional backoff delay added to the spinlock
code, performance with 24 threads goes up to about 400k
operations/second with a 50x delay, and about 900k operations/second
with a 250x delay. However, with a 250x delay, performance with
2-5 threads is worse than with a 50x delay.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention,
and should also protect against the (likely) case of the fixed
delay factor being wrong for other hardware.

The attached graph shows the performance of the multi threaded
semaphore lock/unlock test case, with 1-24 threads, on the
vanilla kernel, with 10x, 50x, and 250x proportional delay,
and with autotuning proportional delay tuning for either 2 or
2.7 spins before the lock is obtained.

Please let me know if you manage to break this code in any way,
so I can fix it...

[-- Attachment #2: spinlock-backoff.png --]
[-- Type: image/png, Size: 20083 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC PATCH 1/3] x86,smp: move waiting on contended lock out of line
  2012-12-21 23:49 [RFC PATCH 0/3] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
@ 2012-12-21 23:50 ` Rik van Riel
  2012-12-22  3:05   ` Steven Rostedt
                     ` (2 more replies)
  2012-12-21 23:51 ` [RFC PATCH 2/3] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 53+ messages in thread
From: Rik van Riel @ 2012-12-21 23:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, walken, lwoodman, jeremy, Jan Beulich, Thomas Gleixner

Subject: x86,smp: move waiting on contended ticket lock out of line

Moving the wait loop for congested loops to its own function allows
us to add things to that wait loop, without growing the size of the
kernel text appreciably.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 arch/x86/include/asm/spinlock.h |   13 +++++++------
 arch/x86/kernel/smp.c           |   14 ++++++++++++++
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 33692ea..2a45eb0 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -34,6 +34,8 @@
 # define UNLOCK_LOCK_PREFIX
 #endif
 
+extern void ticket_spin_lock_wait(arch_spinlock_t *, struct __raw_tickets);
+
 /*
  * Ticket locks are conceptually two parts, one indicating the current head of
  * the queue, and the other indicating the current tail. The lock is acquired
@@ -53,12 +55,11 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
 
 	inc = xadd(&lock->tickets, inc);
 
-	for (;;) {
-		if (inc.head == inc.tail)
-			break;
-		cpu_relax();
-		inc.head = ACCESS_ONCE(lock->tickets.head);
-	}
+	if (inc.head == inc.tail)
+		goto out;
+
+	ticket_spin_lock_wait(lock, inc);
+ out:
 	barrier();		/* make sure nothing creeps before the lock is taken */
 }
 
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 48d2b7d..20da354 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -113,6 +113,20 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
 static bool smp_no_nmi_ipi = false;
 
 /*
+ * Wait on a congested ticket spinlock.
+ */
+void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
+{
+	for (;;) {
+		cpu_relax();
+		inc.head = ACCESS_ONCE(lock->tickets.head);
+
+		if (inc.head == inc.tail)
+			break;
+	}
+}
+
+/*
  * this function sends a 'reschedule' IPI to another CPU.
  * it goes straight through and wastes no time serializing
  * anything. Worst case is that we lose a reschedule ...

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 2/3] x86,smp: proportional backoff for ticket spinlocks
  2012-12-21 23:49 [RFC PATCH 0/3] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
  2012-12-21 23:50 ` [RFC PATCH 1/3] x86,smp: move waiting on contended lock out of line Rik van Riel
@ 2012-12-21 23:51 ` Rik van Riel
  2012-12-22  3:07   ` Steven Rostedt
                     ` (2 more replies)
  2012-12-21 23:51 ` [RFC PATCH 3/3] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
  2013-01-02  0:06 ` ticket spinlock proportional backoff experiments Michel Lespinasse
  3 siblings, 3 replies; 53+ messages in thread
From: Rik van Riel @ 2012-12-21 23:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, walken, lwoodman, jeremy, Jan Beulich, Thomas Gleixner

Subject: x86,smp: proportional backoff for ticket spinlocks

Simple fixed value proportional backoff for ticket spinlocks.
By pounding on the cacheline with the spin lock less often,
bus traffic is reduced. In cases of a data structure with
embedded spinlock, the lock holder has a better chance of
making progress.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 arch/x86/kernel/smp.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 20da354..4e44840 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -118,9 +118,11 @@ static bool smp_no_nmi_ipi = false;
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
 	for (;;) {
-		cpu_relax();
-		inc.head = ACCESS_ONCE(lock->tickets.head);
+		int loops = 50 * (__ticket_t)(inc.tail - inc.head);
+		while (loops--)
+			cpu_relax();
 
+		inc.head = ACCESS_ONCE(lock->tickets.head);
 		if (inc.head == inc.tail)
 			break;
 	}


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 3/3] x86,smp: auto tune spinlock backoff delay factor
  2012-12-21 23:49 [RFC PATCH 0/3] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
  2012-12-21 23:50 ` [RFC PATCH 1/3] x86,smp: move waiting on contended lock out of line Rik van Riel
  2012-12-21 23:51 ` [RFC PATCH 2/3] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
@ 2012-12-21 23:51 ` Rik van Riel
  2012-12-21 23:56   ` [RFC PATCH 3/3 -v2] " Rik van Riel
                     ` (2 more replies)
  2013-01-02  0:06 ` ticket spinlock proportional backoff experiments Michel Lespinasse
  3 siblings, 3 replies; 53+ messages in thread
From: Rik van Riel @ 2012-12-21 23:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, walken, lwoodman, jeremy, Jan Beulich, Thomas Gleixner

Subject: x86,smp: auto tune spinlock backoff delay factor

Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper "Non-scalable locks are dangerous" is a good reference:

	http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

Proportional spinlock delay with a high delay factor works well
when there is lots contention on a lock. Likewise, a smaller
delay factor works well when a lock is lightly contended.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 arch/x86/kernel/smp.c |   49 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 1c1eb02..8786476 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -113,19 +113,62 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
 static bool smp_no_nmi_ipi = false;
 
 /*
- * Wait on a congested ticket spinlock.
+ * Wait on a congested ticket spinlock. Many spinlocks are embedded in
+ * data structures; having many CPUs pounce on the cache line with the
+ * spinlock simultaneously can slow down the lock holder, and the system
+ * as a whole.
+ *
+ * To prevent total performance collapse in case of bad spinlock contention,
+ * perform proportional backoff. The per-cpu value of delay is automatically
+ * tuned to limit the number of times spinning CPUs poll the lock before
+ * obtaining it. This limits the amount of cross-CPU traffic required to obtain
+ * a spinlock, and keeps system performance from dropping off a cliff.
+ *
+ * There is a tradeoff. If we poll too often, the whole system is slowed
+ * down. If we sleep too long, the lock will go unused for a period of
+ * time. Adjusting "delay" to poll, on average, 2.7 times before the
+ * lock is obtained seems to result in low bus traffic. The combination
+ * of aiming for a non-integer amount of average polls, and scaling the
+ * sleep period proportionally to how many CPUs are ahead of us in the
+ * queue for this ticket lock seems to reduce the amount of time spent
+ * "oversleeping" the release of the lock.
  */
+#define MIN_SPINLOCK_DELAY 1
+#define MAX_SPINLOCK_DELAY 1000
+DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
+	/*
+	 * Use the raw per-cpu pointer; preemption is disabled in the
+	 * spinlock code. This avoids put_cpu_var once we have the lock.
+	 */
+	int *delay_ptr = &per_cpu(spinlock_delay, smp_processor_id());
+	int delay = *delay_ptr;
+
 	for (;;) {
-		int loops = 50 * (__ticket_t)(inc.tail - inc.head);
+		int loops = delay * (__ticket_t)(inc.tail - inc.head);
 		while (loops--)
 			cpu_relax();
 
 		inc.head = ACCESS_ONCE(lock->tickets.head);
-		if (inc.head == inc.tail)
+		if (inc.head == inc.tail) {
+			/* Decrease the delay, since we may have overslept. */
+			if (delay > MIN_SPINLOCK_DELAY)
+				delay--;
 			break;
+		}
+
+		/*
+		 * The lock is still busy, the delay was not long enough.
+		 * Going through here 2.7 times will, on average, cancel
+		 * out the decrement above. Using a non-integer number
+		 * gets rid of performance artifacts and reduces oversleeping.
+		 */
+		if (delay < MAX_SPINLOCK_DELAY &&
+				(!(inc.head & 3) == 0 || (inc.head & 7) == 1))
+			delay++;
 	}
+	*delay_ptr = delay;
 }
 
 /*


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2012-12-21 23:51 ` [RFC PATCH 3/3] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
@ 2012-12-21 23:56   ` Rik van Riel
  2012-12-22  0:18     ` Eric Dumazet
                       ` (3 more replies)
  2012-12-22  0:47   ` [RFC PATCH 3/3] " David Daney
  2012-12-22  5:42   ` Michel Lespinasse
  2 siblings, 4 replies; 53+ messages in thread
From: Rik van Riel @ 2012-12-21 23:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: aquini, walken, lwoodman, jeremy, Jan Beulich, Thomas Gleixner

Argh, the first one had a typo in it that did not influence
performance with fewer threads running, but that made things
worse with more than a dozen threads...

Please let me know if you can break these patches.
---8<---
Subject: x86,smp: auto tune spinlock backoff delay factor

Many spinlocks are embedded in data structures; having many CPUs
pounce on the cache line the lock is in will slow down the lock
holder, and can cause system performance to fall off a cliff.

The paper "Non-scalable locks are dangerous" is a good reference:

	http://pdos.csail.mit.edu/papers/linux:lock.pdf

In the Linux kernel, spinlocks are optimized for the case of
there not being contention. After all, if there is contention,
the data structure can be improved to reduce or eliminate
lock contention.

Likewise, the spinlock API should remain simple, and the
common case of the lock not being contended should remain
as fast as ever.

However, since spinlock contention should be fairly uncommon,
we can add functionality into the spinlock slow path that keeps
system performance from falling off a cliff when there is lock
contention.

Proportional delay in ticket locks is delaying the time between
checking the ticket based on a delay factor, and the number of
CPUs ahead of us in the queue for this lock. Checking the lock
less often allows the lock holder to continue running, resulting
in better throughput and preventing performance from dropping
off a cliff.

Proportional spinlock delay with a high delay factor works well
when there is lots contention on a lock. Likewise, a smaller
delay factor works well when a lock is lightly contended.

Making the code auto-tune the delay factor results in a system
that performs well with both light and heavy lock contention.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 arch/x86/kernel/smp.c |   49 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 4e44840..e44c56f 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -113,19 +113,62 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
 static bool smp_no_nmi_ipi = false;
 
 /*
- * Wait on a congested ticket spinlock.
+ * Wait on a congested ticket spinlock. Many spinlocks are embedded in
+ * data structures; having many CPUs pounce on the cache line with the
+ * spinlock simultaneously can slow down the lock holder, and the system
+ * as a whole.
+ *
+ * To prevent total performance collapse in case of bad spinlock contention,
+ * perform proportional backoff. The per-cpu value of delay is automatically
+ * tuned to limit the number of times spinning CPUs poll the lock before
+ * obtaining it. This limits the amount of cross-CPU traffic required to obtain
+ * a spinlock, and keeps system performance from dropping off a cliff.
+ *
+ * There is a tradeoff. If we poll too often, the whole system is slowed
+ * down. If we sleep too long, the lock will go unused for a period of
+ * time. Adjusting "delay" to poll, on average, 2.7 times before the
+ * lock is obtained seems to result in low bus traffic. The combination
+ * of aiming for a non-integer amount of average polls, and scaling the
+ * sleep period proportionally to how many CPUs are ahead of us in the
+ * queue for this ticket lock seems to reduce the amount of time spent
+ * "oversleeping" the release of the lock.
  */
+#define MIN_SPINLOCK_DELAY 1
+#define MAX_SPINLOCK_DELAY 1000
+DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
+	/*
+	 * Use the raw per-cpu pointer; preemption is disabled in the
+	 * spinlock code. This avoids put_cpu_var once we have the lock.
+	 */
+	int *delay_ptr = &per_cpu(spinlock_delay, smp_processor_id());
+	int delay = *delay_ptr;
+
 	for (;;) {
-		int loops = 50 * (__ticket_t)(inc.tail - inc.head);
+		int loops = delay * (__ticket_t)(inc.tail - inc.head);
 		while (loops--)
 			cpu_relax();
 
 		inc.head = ACCESS_ONCE(lock->tickets.head);
-		if (inc.head == inc.tail)
+		if (inc.head == inc.tail) {
+			/* Decrease the delay, since we may have overslept. */
+			if (delay > MIN_SPINLOCK_DELAY)
+				delay--;
 			break;
+		}
+
+		/*
+		 * The lock is still busy, the delay was not long enough.
+		 * Going through here 2.7 times will, on average, cancel
+		 * out the decrement above. Using a non-integer number
+		 * gets rid of performance artifacts and reduces oversleeping.
+		 */
+		if (delay < MAX_SPINLOCK_DELAY &&
+				((inc.head & 3) == 0 || (inc.head & 7) == 1))
+			delay++;
 	}
+	*delay_ptr = delay;
 }
 
 /*


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2012-12-21 23:56   ` [RFC PATCH 3/3 -v2] " Rik van Riel
@ 2012-12-22  0:18     ` Eric Dumazet
  2012-12-22  2:43       ` Rik van Riel
  2012-12-22  0:48     ` Eric Dumazet
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 53+ messages in thread
From: Eric Dumazet @ 2012-12-22  0:18 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, walken, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner

On Fri, 2012-12-21 at 18:56 -0500, Rik van Riel wrote:
> Argh, the first one had a typo in it that did not influence
> performance with fewer threads running, but that made things
> worse with more than a dozen threads...

> +
> +		/*
> +		 * The lock is still busy, the delay was not long enough.
> +		 * Going through here 2.7 times will, on average, cancel
> +		 * out the decrement above. Using a non-integer number
> +		 * gets rid of performance artifacts and reduces oversleeping.
> +		 */
> +		if (delay < MAX_SPINLOCK_DELAY &&
> +				((inc.head & 3) == 0 || (inc.head & 7) == 1))
> +			delay++;

((inc.head & 3) == 0 || (inc.head & 7) == 1)) seems a strange condition
to me...



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3] x86,smp: auto tune spinlock backoff delay factor
  2012-12-21 23:51 ` [RFC PATCH 3/3] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
  2012-12-21 23:56   ` [RFC PATCH 3/3 -v2] " Rik van Riel
@ 2012-12-22  0:47   ` David Daney
  2012-12-22  2:51     ` Rik van Riel
  2012-12-22  5:42   ` Michel Lespinasse
  2 siblings, 1 reply; 53+ messages in thread
From: David Daney @ 2012-12-22  0:47 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, walken, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner

On 12/21/2012 03:51 PM, Rik van Riel wrote:
> Subject: x86,smp: auto tune spinlock backoff delay factor

Nice work,

A couple of comments...

>
> Many spinlocks are embedded in data structures; having many CPUs
> pounce on the cache line the lock is in will slow down the lock
> holder, and can cause system performance to fall off a cliff.
>
> The paper "Non-scalable locks are dangerous" is a good reference:
>
> 	http://pdos.csail.mit.edu/papers/linux:lock.pdf
>
> In the Linux kernel, spinlocks are optimized for the case of
> there not being contention. After all, if there is contention,
> the data structure can be improved to reduce or eliminate
> lock contention.
>
> Likewise, the spinlock API should remain simple, and the
> common case of the lock not being contended should remain
> as fast as ever.
>
> However, since spinlock contention should be fairly uncommon,
> we can add functionality into the spinlock slow path that keeps
> system performance from falling off a cliff when there is lock
> contention.
>
> Proportional delay in ticket locks is delaying the time between
> checking the ticket based on a delay factor, and the number of
> CPUs ahead of us in the queue for this lock. Checking the lock
> less often allows the lock holder to continue running, resulting
> in better throughput and preventing performance from dropping
> off a cliff.
>
> Proportional spinlock delay with a high delay factor works well
> when there is lots contention on a lock. Likewise, a smaller
> delay factor works well when a lock is lightly contended.
>
> Making the code auto-tune the delay factor results in a system
> that performs well with both light and heavy lock contention.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>   arch/x86/kernel/smp.c |   49 ++++++++++++++++++++++++++++++++++++++++++++++---
>   1 files changed, 46 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index 1c1eb02..8786476 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -113,19 +113,62 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
>   static bool smp_no_nmi_ipi = false;
>
>   /*
> - * Wait on a congested ticket spinlock.
> + * Wait on a congested ticket spinlock. Many spinlocks are embedded in
> + * data structures; having many CPUs pounce on the cache line with the
> + * spinlock simultaneously can slow down the lock holder, and the system
> + * as a whole.
> + *
> + * To prevent total performance collapse in case of bad spinlock contention,
> + * perform proportional backoff. The per-cpu value of delay is automatically
> + * tuned to limit the number of times spinning CPUs poll the lock before
> + * obtaining it. This limits the amount of cross-CPU traffic required to obtain
> + * a spinlock, and keeps system performance from dropping off a cliff.
> + *
> + * There is a tradeoff. If we poll too often, the whole system is slowed
> + * down. If we sleep too long, the lock will go unused for a period of
> + * time. Adjusting "delay" to poll, on average, 2.7 times before the
> + * lock is obtained seems to result in low bus traffic. The combination
> + * of aiming for a non-integer amount of average polls, and scaling the
> + * sleep period proportionally to how many CPUs are ahead of us in the
> + * queue for this ticket lock seems to reduce the amount of time spent
> + * "oversleeping" the release of the lock.
>    */
> +#define MIN_SPINLOCK_DELAY 1
> +#define MAX_SPINLOCK_DELAY 1000
> +DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };


This gives the same delay for all locks in the system, but the amount of 
work done under each lock is different.  So, for any given lock, the 
delay is not optimal.

This is an untested idea that came to me after looking at this:

o Assume that for any given lock, the optimal delay is the same for all 
CPUs in the system.

o Store a per-lock delay value in arch_spinlock_t.

o Once a CPU owns the lock it can update the delay as you do for the 
per_cpu version.  Tuning the delay on fewer of the locking operations 
reduces bus traffic, but makes it converge more slowly.

o Bonus points if you can update the delay as part of the releasing store.

>   void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>   {
> +	/*
> +	 * Use the raw per-cpu pointer; preemption is disabled in the
> +	 * spinlock code. This avoids put_cpu_var once we have the lock.
> +	 */
> +	int *delay_ptr = &per_cpu(spinlock_delay, smp_processor_id());
> +	int delay = *delay_ptr;
> +
>   	for (;;) {
> -		int loops = 50 * (__ticket_t)(inc.tail - inc.head);
> +		int loops = delay * (__ticket_t)(inc.tail - inc.head);
>   		while (loops--)
>   			cpu_relax();
>
>   		inc.head = ACCESS_ONCE(lock->tickets.head);
> -		if (inc.head == inc.tail)
> +		if (inc.head == inc.tail) {
> +			/* Decrease the delay, since we may have overslept. */
> +			if (delay > MIN_SPINLOCK_DELAY)
> +				delay--;
>   			break;
> +		}
> +
> +		/*
> +		 * The lock is still busy, the delay was not long enough.
> +		 * Going through here 2.7 times will, on average, cancel
> +		 * out the decrement above. Using a non-integer number
> +		 * gets rid of performance artifacts and reduces oversleeping.
> +		 */
> +		if (delay < MAX_SPINLOCK_DELAY &&
> +				(!(inc.head & 3) == 0 || (inc.head & 7) == 1))
> +			delay++;
>   	}
> +	*delay_ptr = delay;
>   }
>
>   /*
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2012-12-21 23:56   ` [RFC PATCH 3/3 -v2] " Rik van Riel
  2012-12-22  0:18     ` Eric Dumazet
@ 2012-12-22  0:48     ` Eric Dumazet
  2012-12-22  2:57       ` Rik van Riel
  2012-12-22  3:29     ` Eric Dumazet
  2012-12-22  3:33     ` Steven Rostedt
  3 siblings, 1 reply; 53+ messages in thread
From: Eric Dumazet @ 2012-12-22  0:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, walken, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner

On Fri, 2012-12-21 at 18:56 -0500, Rik van Riel wrote:
> Argh, the first one had a typo in it that did not influence
> performance with fewer threads running, but that made things
> worse with more than a dozen threads...
> 
> Please let me know if you can break these patches.
> ---8<---
> Subject: x86,smp: auto tune spinlock backoff delay factor

> +#define MIN_SPINLOCK_DELAY 1
> +#define MAX_SPINLOCK_DELAY 1000
> +DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };

Using a single spinlock_delay per cpu assumes there is a single
contended spinlock on the machine, or that contended
spinlocks protect the same critical section.

Given that we probably know where the contended spinlocks are, couldnt
we use a real scalable implementation for them ?

A known contended one is the Qdisc lock in network layer. We added a
second lock (busylock) to lower a bit the pressure on a separate cache
line, but a scalable lock would be much better...

I guess there are patent issues...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2012-12-22  0:18     ` Eric Dumazet
@ 2012-12-22  2:43       ` Rik van Riel
  0 siblings, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-12-22  2:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux-kernel, aquini, walken, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner

On 12/21/2012 07:18 PM, Eric Dumazet wrote:
> On Fri, 2012-12-21 at 18:56 -0500, Rik van Riel wrote:
>> Argh, the first one had a typo in it that did not influence
>> performance with fewer threads running, but that made things
>> worse with more than a dozen threads...
>
>> +
>> +		/*
>> +		 * The lock is still busy, the delay was not long enough.
>> +		 * Going through here 2.7 times will, on average, cancel
>> +		 * out the decrement above. Using a non-integer number
>> +		 * gets rid of performance artifacts and reduces oversleeping.
>> +		 */
>> +		if (delay < MAX_SPINLOCK_DELAY &&
>> +				((inc.head & 3) == 0 || (inc.head & 7) == 1))
>> +			delay++;
>
> ((inc.head & 3) == 0 || (inc.head & 7) == 1)) seems a strange condition
> to me...

It is. It turned out that doing the increment
every 4 times (just the first check) resulted
in odd performance artifacts when running with
4, 8, 12 or 16 CPUs.

Moving to the above got rid of the performance
artifact.

It also results in aiming for a sleep period
that is not an exact multiple of the lock
acquiring period, which results in less
"oversleeping", and measurably better
performance.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3] x86,smp: auto tune spinlock backoff delay factor
  2012-12-22  0:47   ` [RFC PATCH 3/3] " David Daney
@ 2012-12-22  2:51     ` Rik van Riel
  2012-12-22  3:49       ` Steven Rostedt
  0 siblings, 1 reply; 53+ messages in thread
From: Rik van Riel @ 2012-12-22  2:51 UTC (permalink / raw)
  To: David Daney
  Cc: linux-kernel, aquini, walken, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner

On 12/21/2012 07:47 PM, David Daney wrote:

>> +#define MIN_SPINLOCK_DELAY 1
>> +#define MAX_SPINLOCK_DELAY 1000
>> +DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };
>
>
> This gives the same delay for all locks in the system, but the amount of
> work done under each lock is different.  So, for any given lock, the
> delay is not optimal.
>
> This is an untested idea that came to me after looking at this:
>
> o Assume that for any given lock, the optimal delay is the same for all
> CPUs in the system.
 >
> o Store a per-lock delay value in arch_spinlock_t.
>
> o Once a CPU owns the lock it can update the delay as you do for the
> per_cpu version.  Tuning the delay on fewer of the locking operations
> reduces bus traffic, but makes it converge more slowly.
>
> o Bonus points if you can update the delay as part of the releasing store.

It would absolutely have to be part of the same load and
store cycle, otherwise we would increase bus traffic and
defeat the purpose.

However, since spinlock contention should not be the
usual state, and all a scalable lock does is make sure
that N+1 CPUs does not perform worse than N CPUs, using
scalable locks is a stop-gap measure.

I believe a stop-gap measure should be kept as simple as
we can. I am willing to consider moving to a per-lock
delay factor if we can figure out an easy way to do it,
but I would like to avoid too much extra complexity...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2012-12-22  0:48     ` Eric Dumazet
@ 2012-12-22  2:57       ` Rik van Riel
  0 siblings, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-12-22  2:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux-kernel, aquini, walken, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner

On 12/21/2012 07:48 PM, Eric Dumazet wrote:
> On Fri, 2012-12-21 at 18:56 -0500, Rik van Riel wrote:
>> Argh, the first one had a typo in it that did not influence
>> performance with fewer threads running, but that made things
>> worse with more than a dozen threads...
>>
>> Please let me know if you can break these patches.
>> ---8<---
>> Subject: x86,smp: auto tune spinlock backoff delay factor
>
>> +#define MIN_SPINLOCK_DELAY 1
>> +#define MAX_SPINLOCK_DELAY 1000
>> +DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };
>
> Using a single spinlock_delay per cpu assumes there is a single
> contended spinlock on the machine, or that contended
> spinlocks protect the same critical section.

The goal is to reduce bus traffic, and keep total
system performance from falling through the floor.

If we have one lock that takes N cycles to acquire,
and a second contended lock that takes N*2 cycles
to acquire, checking the first lock fewer times
before acquisition, and the second lock more times,
should still result in similar average system
throughput.

I suspect this approach should work well if we have
multiple contended locks in the system.

> Given that we probably know where the contended spinlocks are, couldnt
> we use a real scalable implementation for them ?

The scalable locks tend to have a slightly more
complex locking API, resulting in a slightly
higher overhead in the non-contended (normal)
case.  That means we cannot use them everywhere.

Also, scalable locks merely make sure that N+1
CPUs perform the same as N CPUs when there is
lock contention.  They do not cause the system
to actually scale.

For actual scalability, the data structure would
need to be changed, so locking requirements are
better.

> A known contended one is the Qdisc lock in network layer. We added a
> second lock (busylock) to lower a bit the pressure on a separate cache
> line, but a scalable lock would be much better...

My locking patches are meant for dealing with the
offenders we do not know about, to make sure that
system performance does not fall off a cliff when
we run into a surprise.

Known scalability bugs we can fix.

Unknown ones should not cause somebody's system
to fail.

> I guess there are patent issues...

At least one of the scalable lock implementations has been
known since 1991, so there should not be any patent issues
with that one.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 1/3] x86,smp: move waiting on contended lock out of line
  2012-12-21 23:50 ` [RFC PATCH 1/3] x86,smp: move waiting on contended lock out of line Rik van Riel
@ 2012-12-22  3:05   ` Steven Rostedt
  2012-12-22  4:40   ` Michel Lespinasse
  2012-12-23 22:52   ` Rafael Aquini
  2 siblings, 0 replies; 53+ messages in thread
From: Steven Rostedt @ 2012-12-22  3:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, walken, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner

On Fri, Dec 21, 2012 at 06:50:38PM -0500, Rik van Riel wrote:
> Subject: x86,smp: move waiting on contended ticket lock out of line
> 
> Moving the wait loop for congested loops to its own function allows
> us to add things to that wait loop, without growing the size of the
> kernel text appreciably.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>

Reviewed-by: Steven Rostedt <rostedt@goodmis.org>

-- Steve


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 2/3] x86,smp: proportional backoff for ticket spinlocks
  2012-12-21 23:51 ` [RFC PATCH 2/3] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
@ 2012-12-22  3:07   ` Steven Rostedt
  2012-12-22  3:14     ` Steven Rostedt
  2012-12-22  4:44   ` Michel Lespinasse
  2012-12-23 22:55   ` Rafael Aquini
  2 siblings, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2012-12-22  3:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, walken, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner

On Fri, Dec 21, 2012 at 06:51:15PM -0500, Rik van Riel wrote:
> Subject: x86,smp: proportional backoff for ticket spinlocks
> 
> Simple fixed value proportional backoff for ticket spinlocks.
> By pounding on the cacheline with the spin lock less often,
> bus traffic is reduced. In cases of a data structure with
> embedded spinlock, the lock holder has a better chance of
> making progress.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  arch/x86/kernel/smp.c |    6 ++++--
>  1 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index 20da354..4e44840 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -118,9 +118,11 @@ static bool smp_no_nmi_ipi = false;
>  void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>  {
>  	for (;;) {
> -		cpu_relax();
> -		inc.head = ACCESS_ONCE(lock->tickets.head);
> +		int loops = 50 * (__ticket_t)(inc.tail - inc.head);
> +		while (loops--)
> +			cpu_relax();

-ENOCOMMENT

Please add a comment above to explain what it's doing. Don't expect
people to check change logs. Also, explain why you picked 50.

-- Steve

>  
> +		inc.head = ACCESS_ONCE(lock->tickets.head);
>  		if (inc.head == inc.tail)
>  			break;
>  	}
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 2/3] x86,smp: proportional backoff for ticket spinlocks
  2012-12-22  3:07   ` Steven Rostedt
@ 2012-12-22  3:14     ` Steven Rostedt
  2012-12-22  3:47       ` Rik van Riel
  0 siblings, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2012-12-22  3:14 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, walken, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner

On Fri, Dec 21, 2012 at 10:07:56PM -0500, Steven Rostedt wrote:
> > diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> > index 20da354..4e44840 100644
> > --- a/arch/x86/kernel/smp.c
> > +++ b/arch/x86/kernel/smp.c
> > @@ -118,9 +118,11 @@ static bool smp_no_nmi_ipi = false;
> >  void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
> >  {
> >  	for (;;) {
> > -		cpu_relax();
> > -		inc.head = ACCESS_ONCE(lock->tickets.head);
> > +		int loops = 50 * (__ticket_t)(inc.tail - inc.head);
> > +		while (loops--)
> > +			cpu_relax();
> 
> -ENOCOMMENT
> 
> Please add a comment above to explain what it's doing. Don't expect
> people to check change logs. Also, explain why you picked 50.
> 

OK, I replied here before reading patch 3 (still reviewing it). Why have
this patch at all? Just to test if you broke something between this and
patch 3? Or perhaps patch 3 may not get accepted? In that case, you
would still need a comment.

Either explicitly state that this patch is just a stepping stone for
patch 3, and will either be accepted or rejected along with patch 3. Or
keep it as a stand alone patch and add comments as such. Or just get rid
of it all together.

Thanks,

-- Steve


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2012-12-21 23:56   ` [RFC PATCH 3/3 -v2] " Rik van Riel
  2012-12-22  0:18     ` Eric Dumazet
  2012-12-22  0:48     ` Eric Dumazet
@ 2012-12-22  3:29     ` Eric Dumazet
  2012-12-22  3:44       ` Rik van Riel
  2012-12-22  3:33     ` Steven Rostedt
  3 siblings, 1 reply; 53+ messages in thread
From: Eric Dumazet @ 2012-12-22  3:29 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, walken, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner

On Fri, 2012-12-21 at 18:56 -0500, Rik van Riel wrote:
> +	int *delay_ptr = &per_cpu(spinlock_delay, smp_processor_id());
> +	int delay = *delay_ptr;

int delay = __this_cpu_read(spinlock_delay);

>  	}
> +	*delay_ptr = delay;

__this_cpu_write(spinlock_delay, delay);



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2012-12-21 23:56   ` [RFC PATCH 3/3 -v2] " Rik van Riel
                       ` (2 preceding siblings ...)
  2012-12-22  3:29     ` Eric Dumazet
@ 2012-12-22  3:33     ` Steven Rostedt
  2012-12-22  3:50       ` Rik van Riel
  3 siblings, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2012-12-22  3:33 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, walken, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner

On Fri, Dec 21, 2012 at 06:56:13PM -0500, Rik van Riel wrote:
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index 4e44840..e44c56f 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -113,19 +113,62 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
>  static bool smp_no_nmi_ipi = false;
>  
>  /*
> - * Wait on a congested ticket spinlock.
> + * Wait on a congested ticket spinlock. Many spinlocks are embedded in
> + * data structures; having many CPUs pounce on the cache line with the
> + * spinlock simultaneously can slow down the lock holder, and the system
> + * as a whole.
> + *
> + * To prevent total performance collapse in case of bad spinlock contention,
> + * perform proportional backoff. The per-cpu value of delay is automatically
> + * tuned to limit the number of times spinning CPUs poll the lock before
> + * obtaining it. This limits the amount of cross-CPU traffic required to obtain
> + * a spinlock, and keeps system performance from dropping off a cliff.
> + *
> + * There is a tradeoff. If we poll too often, the whole system is slowed
> + * down. If we sleep too long, the lock will go unused for a period of
> + * time. Adjusting "delay" to poll, on average, 2.7 times before the
> + * lock is obtained seems to result in low bus traffic. The combination
> + * of aiming for a non-integer amount of average polls, and scaling the
> + * sleep period proportionally to how many CPUs are ahead of us in the
> + * queue for this ticket lock seems to reduce the amount of time spent
> + * "oversleeping" the release of the lock.
>   */
> +#define MIN_SPINLOCK_DELAY 1
> +#define MAX_SPINLOCK_DELAY 1000
> +DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };
>  void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>  {
> +	/*
> +	 * Use the raw per-cpu pointer; preemption is disabled in the
> +	 * spinlock code. This avoids put_cpu_var once we have the lock.
> +	 */
> +	int *delay_ptr = &per_cpu(spinlock_delay, smp_processor_id());
> +	int delay = *delay_ptr;

I'm confused by the above comment. Why not just:

	int delay = this_cpu_read(spinlock_delay);
?

> +
>  	for (;;) {
> -		int loops = 50 * (__ticket_t)(inc.tail - inc.head);
> +		int loops = delay * (__ticket_t)(inc.tail - inc.head);
>  		while (loops--)
>  			cpu_relax();
>  
>  		inc.head = ACCESS_ONCE(lock->tickets.head);
> -		if (inc.head == inc.tail)
> +		if (inc.head == inc.tail) {
> +			/* Decrease the delay, since we may have overslept. */
> +			if (delay > MIN_SPINLOCK_DELAY)
> +				delay--;
>  			break;
> +		}
> +
> +		/*
> +		 * The lock is still busy, the delay was not long enough.
> +		 * Going through here 2.7 times will, on average, cancel
> +		 * out the decrement above. Using a non-integer number
> +		 * gets rid of performance artifacts and reduces oversleeping.
> +		 */
> +		if (delay < MAX_SPINLOCK_DELAY &&
> +				((inc.head & 3) == 0 || (inc.head & 7) == 1))
> +			delay++;
>  	}
> +	*delay_ptr = delay;

	this_cpu_write(spinlock_delay, delay);

Too bad you posted this just before break. I currently have access to a
40 core box, and I would have loved to test this. But right now I have
it testing other things, and hopefully I'll still have access to it
after the break.

-- Steve

>  }
>  
>  /*
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2012-12-22  3:29     ` Eric Dumazet
@ 2012-12-22  3:44       ` Rik van Riel
  0 siblings, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-12-22  3:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux-kernel, aquini, walken, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner

On 12/21/2012 10:29 PM, Eric Dumazet wrote:
> On Fri, 2012-12-21 at 18:56 -0500, Rik van Riel wrote:
>> +	int *delay_ptr = &per_cpu(spinlock_delay, smp_processor_id());
>> +	int delay = *delay_ptr;
>
> int delay = __this_cpu_read(spinlock_delay);
>
>>   	}
>> +	*delay_ptr = delay;
>
> __this_cpu_write(spinlock_delay, delay);

Thanks for that cleanup. I have applied it to my code.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 2/3] x86,smp: proportional backoff for ticket spinlocks
  2012-12-22  3:14     ` Steven Rostedt
@ 2012-12-22  3:47       ` Rik van Riel
  0 siblings, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-12-22  3:47 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, aquini, walken, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner

On 12/21/2012 10:14 PM, Steven Rostedt wrote:

> OK, I replied here before reading patch 3 (still reviewing it). Why have
> this patch at all? Just to test if you broke something between this and
> patch 3? Or perhaps patch 3 may not get accepted? In that case, you
> would still need a comment.
>
> Either explicitly state that this patch is just a stepping stone for
> patch 3, and will either be accepted or rejected along with patch 3. Or
> keep it as a stand alone patch and add comments as such. Or just get rid
> of it all together.

I will document this patch better, explaining that it is
a stepping stone, that the number 50 is likely to be
wrong for many systems, and that the next patch fixes
things, using this text in the changelog:



The number 50 is likely to be wrong for many setups, and
this patch is mostly to illustrate the concept of proportional
backup. The next patch automatically tunes the delay value.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3] x86,smp: auto tune spinlock backoff delay factor
  2012-12-22  2:51     ` Rik van Riel
@ 2012-12-22  3:49       ` Steven Rostedt
  2012-12-22  3:58         ` Rik van Riel
  0 siblings, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2012-12-22  3:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: David Daney, linux-kernel, aquini, walken, lwoodman, jeremy,
	Jan Beulich, Thomas Gleixner

On Fri, Dec 21, 2012 at 09:51:35PM -0500, Rik van Riel wrote:
> On 12/21/2012 07:47 PM, David Daney wrote:
> 
> >>+#define MIN_SPINLOCK_DELAY 1
> >>+#define MAX_SPINLOCK_DELAY 1000
> >>+DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };
> >
> >
> >This gives the same delay for all locks in the system, but the amount of
> >work done under each lock is different.  So, for any given lock, the
> >delay is not optimal.
> >
> >This is an untested idea that came to me after looking at this:
> >
> >o Assume that for any given lock, the optimal delay is the same for all
> >CPUs in the system.
> >
> >o Store a per-lock delay value in arch_spinlock_t.

This can bloat the data structures. I would like to avoid that.

> >
> >o Once a CPU owns the lock it can update the delay as you do for the
> >per_cpu version.  Tuning the delay on fewer of the locking operations
> >reduces bus traffic, but makes it converge more slowly.
> >
> >o Bonus points if you can update the delay as part of the releasing store.
> 
> It would absolutely have to be part of the same load and
> store cycle, otherwise we would increase bus traffic and
> defeat the purpose.
> 
> However, since spinlock contention should not be the
> usual state, and all a scalable lock does is make sure
> that N+1 CPUs does not perform worse than N CPUs, using
> scalable locks is a stop-gap measure.
> 
> I believe a stop-gap measure should be kept as simple as
> we can. I am willing to consider moving to a per-lock
> delay factor if we can figure out an easy way to do it,
> but I would like to avoid too much extra complexity...

Rik,

I like your solution. It's rather simple and simple solutions tend to
end up being the closest to optimal. The more complex a solution gets,
the more it starts chasing fireflies.

Locks that are not likely to be contended will most likely not hit this
code, as it only gets triggered when contended. Now, really the wait
will be the size of the critical section. If quick locks gets hit a lot,
the auto delay will shink. If long critical sections start getting
contention, the auto delay will grow. But the general delay should
become a balance in the system that should be ideal. Kind of like the
NUMA balancing ;-)

Anyway, I'd like to see this code tested, and more benchmarks run
against it.

Thanks,

-- Steve



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2012-12-22  3:33     ` Steven Rostedt
@ 2012-12-22  3:50       ` Rik van Riel
  2012-12-26 19:10         ` Eric Dumazet
  0 siblings, 1 reply; 53+ messages in thread
From: Rik van Riel @ 2012-12-22  3:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, aquini, walken, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner

On 12/21/2012 10:33 PM, Steven Rostedt wrote:
> On Fri, Dec 21, 2012 at 06:56:13PM -0500, Rik van Riel wrote:
>> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
>> index 4e44840..e44c56f 100644
>> --- a/arch/x86/kernel/smp.c
>> +++ b/arch/x86/kernel/smp.c
>> @@ -113,19 +113,62 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
>>   static bool smp_no_nmi_ipi = false;
>>
>>   /*
>> - * Wait on a congested ticket spinlock.
>> + * Wait on a congested ticket spinlock. Many spinlocks are embedded in
>> + * data structures; having many CPUs pounce on the cache line with the
>> + * spinlock simultaneously can slow down the lock holder, and the system
>> + * as a whole.
>> + *
>> + * To prevent total performance collapse in case of bad spinlock contention,
>> + * perform proportional backoff. The per-cpu value of delay is automatically
>> + * tuned to limit the number of times spinning CPUs poll the lock before
>> + * obtaining it. This limits the amount of cross-CPU traffic required to obtain
>> + * a spinlock, and keeps system performance from dropping off a cliff.
>> + *
>> + * There is a tradeoff. If we poll too often, the whole system is slowed
>> + * down. If we sleep too long, the lock will go unused for a period of
>> + * time. Adjusting "delay" to poll, on average, 2.7 times before the
>> + * lock is obtained seems to result in low bus traffic. The combination
>> + * of aiming for a non-integer amount of average polls, and scaling the
>> + * sleep period proportionally to how many CPUs are ahead of us in the
>> + * queue for this ticket lock seems to reduce the amount of time spent
>> + * "oversleeping" the release of the lock.
>>    */
>> +#define MIN_SPINLOCK_DELAY 1
>> +#define MAX_SPINLOCK_DELAY 1000
>> +DEFINE_PER_CPU(int, spinlock_delay) = { MIN_SPINLOCK_DELAY };
>>   void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>>   {
>> +	/*
>> +	 * Use the raw per-cpu pointer; preemption is disabled in the
>> +	 * spinlock code. This avoids put_cpu_var once we have the lock.
>> +	 */
>> +	int *delay_ptr = &per_cpu(spinlock_delay, smp_processor_id());
>> +	int delay = *delay_ptr;
>
> I'm confused by the above comment. Why not just:
>
> 	int delay = this_cpu_read(spinlock_delay);
> ?

Eric Dumazet pointed out the same thing.  My code now
uses __this_cpu_read and __this_cpu_write.

> Too bad you posted this just before break. I currently have access to a
> 40 core box, and I would have loved to test this. But right now I have
> it testing other things, and hopefully I'll still have access to it
> after the break.

I will try to run this test on a really large SMP system
in the lab during the break.

Ideally, the auto-tuning will keep the delay value large
enough that performance will stay flat even when there are
100 CPUs contending over the same lock.

Maybe it turns out that the maximum allowed delay value
needs to be larger.  Only one way to find out...


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3] x86,smp: auto tune spinlock backoff delay factor
  2012-12-22  3:49       ` Steven Rostedt
@ 2012-12-22  3:58         ` Rik van Riel
  2012-12-23 23:08           ` Rafael Aquini
  0 siblings, 1 reply; 53+ messages in thread
From: Rik van Riel @ 2012-12-22  3:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: David Daney, linux-kernel, aquini, walken, lwoodman, jeremy,
	Jan Beulich, Thomas Gleixner

On 12/21/2012 10:49 PM, Steven Rostedt wrote:
> On Fri, Dec 21, 2012 at 09:51:35PM -0500, Rik van Riel wrote:

>> However, since spinlock contention should not be the
>> usual state, and all a scalable lock does is make sure
>> that N+1 CPUs does not perform worse than N CPUs, using
>> scalable locks is a stop-gap measure.
>>
>> I believe a stop-gap measure should be kept as simple as
>> we can. I am willing to consider moving to a per-lock
>> delay factor if we can figure out an easy way to do it,
>> but I would like to avoid too much extra complexity...
>
> Rik,
>
> I like your solution. It's rather simple and simple solutions tend to
> end up being the closest to optimal. The more complex a solution gets,
> the more it starts chasing fireflies.

> Anyway, I'd like to see this code tested, and more benchmarks run
> against it.

Absolutely.  I would love to see if this code actually
causes regressions anywhere.

It is simple enough that I suspect it will not, but there
really is only one way to find out.

The more people test this with different workloads on
different SMP systems, the better.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 1/3] x86,smp: move waiting on contended lock out of line
  2012-12-21 23:50 ` [RFC PATCH 1/3] x86,smp: move waiting on contended lock out of line Rik van Riel
  2012-12-22  3:05   ` Steven Rostedt
@ 2012-12-22  4:40   ` Michel Lespinasse
  2012-12-22  4:48     ` Rik van Riel
  2012-12-23 22:52   ` Rafael Aquini
  2 siblings, 1 reply; 53+ messages in thread
From: Michel Lespinasse @ 2012-12-22  4:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner

On Fri, Dec 21, 2012 at 3:50 PM, Rik van Riel <riel@redhat.com> wrote:
> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> index 33692ea..2a45eb0 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -34,6 +34,8 @@
>  # define UNLOCK_LOCK_PREFIX
>  #endif
>
> +extern void ticket_spin_lock_wait(arch_spinlock_t *, struct __raw_tickets);
> +
>  /*
>   * Ticket locks are conceptually two parts, one indicating the current head of
>   * the queue, and the other indicating the current tail. The lock is acquired
> @@ -53,12 +55,11 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
>
>         inc = xadd(&lock->tickets, inc);
>
> -       for (;;) {
> -               if (inc.head == inc.tail)
> -                       break;
> -               cpu_relax();
> -               inc.head = ACCESS_ONCE(lock->tickets.head);
> -       }
> +       if (inc.head == inc.tail)
> +               goto out;
> +
> +       ticket_spin_lock_wait(lock, inc);
> + out:

why not just:

if (inc.head != inc.tail)
  ticket_spin_lock_wait(lock, inc)

>         barrier();              /* make sure nothing creeps before the lock is taken */
>  }
>
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index 48d2b7d..20da354 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -113,6 +113,20 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
>  static bool smp_no_nmi_ipi = false;
>
>  /*
> + * Wait on a congested ticket spinlock.
> + */
> +void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
> +{
> +       for (;;) {
> +               cpu_relax();
> +               inc.head = ACCESS_ONCE(lock->tickets.head);
> +
> +               if (inc.head == inc.tail)
> +                       break;
> +       }

Why not just:

do {
  cpu_relax()
  inc.head = ...
} while (inc.head != inc.tail);


Other than that, no problems with the principle of it.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 2/3] x86,smp: proportional backoff for ticket spinlocks
  2012-12-21 23:51 ` [RFC PATCH 2/3] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
  2012-12-22  3:07   ` Steven Rostedt
@ 2012-12-22  4:44   ` Michel Lespinasse
  2012-12-23 22:55   ` Rafael Aquini
  2 siblings, 0 replies; 53+ messages in thread
From: Michel Lespinasse @ 2012-12-22  4:44 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner

On Fri, Dec 21, 2012 at 3:51 PM, Rik van Riel <riel@redhat.com> wrote:
> Subject: x86,smp: proportional backoff for ticket spinlocks
>
> Simple fixed value proportional backoff for ticket spinlocks.
> By pounding on the cacheline with the spin lock less often,
> bus traffic is reduced. In cases of a data structure with
> embedded spinlock, the lock holder has a better chance of
> making progress.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>

Looks fine to me other than the arbitrary-ness of 50

Reviewed-by: Michel Lespinasse <walken@google.com>

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 1/3] x86,smp: move waiting on contended lock out of line
  2012-12-22  4:40   ` Michel Lespinasse
@ 2012-12-22  4:48     ` Rik van Riel
  0 siblings, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-12-22  4:48 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-kernel, aquini, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner

On 12/21/2012 11:40 PM, Michel Lespinasse wrote:
> On Fri, Dec 21, 2012 at 3:50 PM, Rik van Riel <riel@redhat.com> wrote:

>> @@ -53,12 +55,11 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
>>
>>          inc = xadd(&lock->tickets, inc);

>> +       if (inc.head == inc.tail)
>> +               goto out;
>> +
>> +       ticket_spin_lock_wait(lock, inc);
>> + out:
>
> why not just:
>
> if (inc.head != inc.tail)
>    ticket_spin_lock_wait(lock, inc)

That makes the code nicer, thank you. Applied.

>> +++ b/arch/x86/kernel/smp.c
>> @@ -113,6 +113,20 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
>>   static bool smp_no_nmi_ipi = false;
>>
>>   /*
>> + * Wait on a congested ticket spinlock.
>> + */
>> +void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>> +{
>> +       for (;;) {
>> +               cpu_relax();
>> +               inc.head = ACCESS_ONCE(lock->tickets.head);
>> +
>> +               if (inc.head == inc.tail)
>> +                       break;
>> +       }
>
> Why not just:
>
> do {
>    cpu_relax()
>    inc.head = ...
> } while (inc.head != inc.tail);
>
>
> Other than that, no problems with the principle of it.

In patch #3 I do something else inside the head == tail
conditional block, so this one is best left alone.

Thank you for the comments.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3] x86,smp: auto tune spinlock backoff delay factor
  2012-12-21 23:51 ` [RFC PATCH 3/3] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
  2012-12-21 23:56   ` [RFC PATCH 3/3 -v2] " Rik van Riel
  2012-12-22  0:47   ` [RFC PATCH 3/3] " David Daney
@ 2012-12-22  5:42   ` Michel Lespinasse
  2012-12-22 14:32     ` Rik van Riel
  2 siblings, 1 reply; 53+ messages in thread
From: Michel Lespinasse @ 2012-12-22  5:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner

On Fri, Dec 21, 2012 at 3:51 PM, Rik van Riel <riel@redhat.com> wrote:
> Subject: x86,smp: auto tune spinlock backoff delay factor
>
> Many spinlocks are embedded in data structures; having many CPUs
> pounce on the cache line the lock is in will slow down the lock
> holder, and can cause system performance to fall off a cliff.
>
> The paper "Non-scalable locks are dangerous" is a good reference:
>
>         http://pdos.csail.mit.edu/papers/linux:lock.pdf
>
> In the Linux kernel, spinlocks are optimized for the case of
> there not being contention. After all, if there is contention,
> the data structure can be improved to reduce or eliminate
> lock contention.
>
> Likewise, the spinlock API should remain simple, and the
> common case of the lock not being contended should remain
> as fast as ever.
>
> However, since spinlock contention should be fairly uncommon,
> we can add functionality into the spinlock slow path that keeps
> system performance from falling off a cliff when there is lock
> contention.
>
> Proportional delay in ticket locks is delaying the time between
> checking the ticket based on a delay factor, and the number of
> CPUs ahead of us in the queue for this lock. Checking the lock
> less often allows the lock holder to continue running, resulting
> in better throughput and preventing performance from dropping
> off a cliff.
>
> Proportional spinlock delay with a high delay factor works well
> when there is lots contention on a lock. Likewise, a smaller
> delay factor works well when a lock is lightly contended.
>
> Making the code auto-tune the delay factor results in a system
> that performs well with both light and heavy lock contention.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>

So I like the idea a lot, and I had never seen the auto-tuning as you
propose it. Your implementation looks simple enough and doesn't slow
the uncontended case, so props for that.

However, I have a few concerns about the behavior of this, which I
think deserve more experimentation (I may try helping with it after
new years).

One thing you mentioned in 0/3 is that the best value varies depending
on the number of CPUs contending. This is somewhat surprising to me; I
would have guessed/hoped that the (inc.tail - inc.head) multiplicative
factor would account for that already. I wonder if we can somehow
adjust the code so that a same constant would work no matter how many
threads are contending for the lock (note that one single read to the
spinlock word gives us both the current number of waiters and our
position among them).

The other factor that might change your auto-tuned value is the amount
of time that each thread holds the spinlock. I wonder what would
happen if the constant was tuned for spinlocks that have a low hold
time, and then used on spinlocks that might have a higher hold time.
obviously this would result in accessing the spinlock word more often
than necessary, but it shouldn't be very bad since the accesses
wouldn't be any more frequent than in the low hold time case, where
throughput is good. So maybe this would work acceptably well.

What I'm getting at is that I would be more confident that the
autotune algorithm will work well in all cases if the value only
depended on the system parameters such as CPU type and frequency,
rather than per-spinlock parameters such as number of waiters and hold
time.

I feel this review is too high-level to be really helpful, so I'll
stop until I can find time to experiment :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3] x86,smp: auto tune spinlock backoff delay factor
  2012-12-22  5:42   ` Michel Lespinasse
@ 2012-12-22 14:32     ` Rik van Riel
  0 siblings, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2012-12-22 14:32 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-kernel, aquini, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner

On 12/22/2012 12:42 AM, Michel Lespinasse wrote:

> However, I have a few concerns about the behavior of this, which I
> think deserve more experimentation (I may try helping with it after
> new years).

More experimentation is always good. Different hardware
will probably behave differently, etc...

> One thing you mentioned in 0/3 is that the best value varies depending
> on the number of CPUs contending. This is somewhat surprising to me; I
> would have guessed/hoped that the (inc.tail - inc.head) multiplicative
> factor would account for that already.

I had hoped the same, but testing things out while
developing the code showed it not to be so.

A larger constant scales to a larger number of CPUs,
while a smaller constant works faster when there are
few CPUs contending on the lock.

The graph attached to patch 0/3 shows this effect.

> What I'm getting at is that I would be more confident that the
> autotune algorithm will work well in all cases if the value only
> depended on the system parameters such as CPU type and frequency,
> rather than per-spinlock parameters such as number of waiters and hold
> time.

The autotune algorithm adjusts the delay factor so
we access the spinlock, on average, 2.7 times before
we acquire it (the 3.7th time).

This provides scalability by keeping the number of
shared cache line touches constant for each lock
acquisition.

Going for a fixed value, either at compile time or
at boot time, cannot provide such a guarantee.

> I feel this review is too high-level to be really helpful, so I'll
> stop until I can find time to experiment :)

I am looking forward to test results.

If anyone manages to break the code, I will fix it...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 1/3] x86,smp: move waiting on contended lock out of line
  2012-12-21 23:50 ` [RFC PATCH 1/3] x86,smp: move waiting on contended lock out of line Rik van Riel
  2012-12-22  3:05   ` Steven Rostedt
  2012-12-22  4:40   ` Michel Lespinasse
@ 2012-12-23 22:52   ` Rafael Aquini
  2 siblings, 0 replies; 53+ messages in thread
From: Rafael Aquini @ 2012-12-23 22:52 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, walken, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner

On Fri, Dec 21, 2012 at 06:50:38PM -0500, Rik van Riel wrote:
> Subject: x86,smp: move waiting on contended ticket lock out of line
> 
> Moving the wait loop for congested loops to its own function allows
> us to add things to that wait loop, without growing the size of the
> kernel text appreciably.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---

Reviewed-by: Rafael Aquini <aquini@redhat.com>


>  arch/x86/include/asm/spinlock.h |   13 +++++++------
>  arch/x86/kernel/smp.c           |   14 ++++++++++++++
>  2 files changed, 21 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> index 33692ea..2a45eb0 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -34,6 +34,8 @@
>  # define UNLOCK_LOCK_PREFIX
>  #endif
>  
> +extern void ticket_spin_lock_wait(arch_spinlock_t *, struct __raw_tickets);
> +
>  /*
>   * Ticket locks are conceptually two parts, one indicating the current head of
>   * the queue, and the other indicating the current tail. The lock is acquired
> @@ -53,12 +55,11 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
>  
>  	inc = xadd(&lock->tickets, inc);
>  
> -	for (;;) {
> -		if (inc.head == inc.tail)
> -			break;
> -		cpu_relax();
> -		inc.head = ACCESS_ONCE(lock->tickets.head);
> -	}
> +	if (inc.head == inc.tail)
> +		goto out;
> +
> +	ticket_spin_lock_wait(lock, inc);
> + out:
>  	barrier();		/* make sure nothing creeps before the lock is taken */
>  }
>  
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index 48d2b7d..20da354 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -113,6 +113,20 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
>  static bool smp_no_nmi_ipi = false;
>  
>  /*
> + * Wait on a congested ticket spinlock.
> + */
> +void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
> +{
> +	for (;;) {
> +		cpu_relax();
> +		inc.head = ACCESS_ONCE(lock->tickets.head);
> +
> +		if (inc.head == inc.tail)
> +			break;
> +	}
> +}
> +
> +/*
>   * this function sends a 'reschedule' IPI to another CPU.
>   * it goes straight through and wastes no time serializing
>   * anything. Worst case is that we lose a reschedule ...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 2/3] x86,smp: proportional backoff for ticket spinlocks
  2012-12-21 23:51 ` [RFC PATCH 2/3] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
  2012-12-22  3:07   ` Steven Rostedt
  2012-12-22  4:44   ` Michel Lespinasse
@ 2012-12-23 22:55   ` Rafael Aquini
  2 siblings, 0 replies; 53+ messages in thread
From: Rafael Aquini @ 2012-12-23 22:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, walken, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner

On Fri, Dec 21, 2012 at 06:51:15PM -0500, Rik van Riel wrote:
> Subject: x86,smp: proportional backoff for ticket spinlocks
> 
> Simple fixed value proportional backoff for ticket spinlocks.
> By pounding on the cacheline with the spin lock less often,
> bus traffic is reduced. In cases of a data structure with
> embedded spinlock, the lock holder has a better chance of
> making progress.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---

Reviewed-by: Rafael Aquini <aquini@redhat.com>

>  arch/x86/kernel/smp.c |    6 ++++--
>  1 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
> index 20da354..4e44840 100644
> --- a/arch/x86/kernel/smp.c
> +++ b/arch/x86/kernel/smp.c
> @@ -118,9 +118,11 @@ static bool smp_no_nmi_ipi = false;
>  void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
>  {
>  	for (;;) {
> -		cpu_relax();
> -		inc.head = ACCESS_ONCE(lock->tickets.head);
> +		int loops = 50 * (__ticket_t)(inc.tail - inc.head);
> +		while (loops--)
> +			cpu_relax();
>  
> +		inc.head = ACCESS_ONCE(lock->tickets.head);
>  		if (inc.head == inc.tail)
>  			break;
>  	}
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3] x86,smp: auto tune spinlock backoff delay factor
  2012-12-22  3:58         ` Rik van Riel
@ 2012-12-23 23:08           ` Rafael Aquini
  0 siblings, 0 replies; 53+ messages in thread
From: Rafael Aquini @ 2012-12-23 23:08 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Steven Rostedt, David Daney, linux-kernel, walken, lwoodman,
	jeremy, Jan Beulich, Thomas Gleixner

On Fri, Dec 21, 2012 at 10:58:48PM -0500, Rik van Riel wrote:
> On 12/21/2012 10:49 PM, Steven Rostedt wrote:
> >On Fri, Dec 21, 2012 at 09:51:35PM -0500, Rik van Riel wrote:
> 
> >>However, since spinlock contention should not be the
> >>usual state, and all a scalable lock does is make sure
> >>that N+1 CPUs does not perform worse than N CPUs, using
> >>scalable locks is a stop-gap measure.
> >>
> >>I believe a stop-gap measure should be kept as simple as
> >>we can. I am willing to consider moving to a per-lock
> >>delay factor if we can figure out an easy way to do it,
> >>but I would like to avoid too much extra complexity...
> >
> >Rik,
> >
> >I like your solution. It's rather simple and simple solutions tend to
> >end up being the closest to optimal. The more complex a solution gets,
> >the more it starts chasing fireflies.
> 
> >Anyway, I'd like to see this code tested, and more benchmarks run
> >against it.
> 
> Absolutely.  I would love to see if this code actually
> causes regressions anywhere.
> 
> It is simple enough that I suspect it will not, but there
> really is only one way to find out.
> 
> The more people test this with different workloads on
> different SMP systems, the better.
>

Great work Rik,

I have a couple of small SMP systems I'll start to test with your patches, also
I might be able to test this work after new year's eve on a big SMP box that 
seems to be facing a severe lock starvation issue due to the BUS saturation 
your work is aiming to reduce.

Cheers!
-- Rafael

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2012-12-22  3:50       ` Rik van Riel
@ 2012-12-26 19:10         ` Eric Dumazet
  2012-12-26 19:27           ` Eric Dumazet
                             ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Eric Dumazet @ 2012-12-26 19:10 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Steven Rostedt, linux-kernel, aquini, walken, lwoodman, jeremy,
	Jan Beulich, Thomas Gleixner, Tom Herbert

On Fri, 2012-12-21 at 22:50 -0500, Rik van Riel wrote:

> I will try to run this test on a really large SMP system
> in the lab during the break.
> 
> Ideally, the auto-tuning will keep the delay value large
> enough that performance will stay flat even when there are
> 100 CPUs contending over the same lock.
> 
> Maybe it turns out that the maximum allowed delay value
> needs to be larger.  Only one way to find out...
> 

Hi Rik

I did some tests with your patches with following configuration :

tc qdisc add dev eth0 root htb r2q 1000 default 3
(to force a contention on qdisc lock, even with a multi queue net
device)

and 24 concurrent "netperf -t UDP_STREAM -H other_machine -- -m 128"

Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
(24 threads), and a fast NIC (10Gbps)

Resulting in a 13 % regression (676 Mbits -> 595 Mbits)

In this workload we have at least two contended spinlocks, with
different delays. (spinlocks are not held for the same duration)

It clearly defeats your assumption of a single per cpu delay being OK :
Some cpus are spinning too long while the lock was released.

We might try to use a hash on lock address, and an array of 16 different
delays so that different spinlocks have a chance of not sharing the same
delay.

With following patch, I get 982 Mbits/s with same bench, so an increase
of 45 % instead of a 13 % regression.

 
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 48d2b7d..59f98f6 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -23,6 +23,7 @@
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
 #include <linux/gfp.h>
+#include <linux/hash.h>
 
 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
@@ -113,6 +114,55 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
 static bool smp_no_nmi_ipi = false;
 
 /*
+ * Wait on a congested ticket spinlock.
+ */
+#define MIN_SPINLOCK_DELAY 1
+#define MAX_SPINLOCK_DELAY 1000
+#define DELAY_HASH_SHIFT 4
+DEFINE_PER_CPU(int [1 << DELAY_HASH_SHIFT], spinlock_delay) = { 
+	MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
+	MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
+	MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
+	MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
+	MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
+	MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
+	MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
+	MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
+};
+void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
+{
+	unsigned int slot = hash_32((u32)(unsigned long)lock, DELAY_HASH_SHIFT);
+	int delay = __this_cpu_read(spinlock_delay[slot]);
+
+	for (;;) {
+		int loops = delay * (__ticket_t)(inc.tail - inc.head);
+
+		while (loops--)
+			cpu_relax();
+
+		inc.head = ACCESS_ONCE(lock->tickets.head);
+
+		if (inc.head == inc.tail) {
+			/* Decrease the delay, since we may have overslept. */
+			if (delay > MIN_SPINLOCK_DELAY)
+				delay--;
+			break;
+		}
+
+		/*
+		 * The lock is still busy, the delay was not long enough.
+		 * Going through here 2.7 times will, on average, cancel
+		 * out the decrement above. Using a non-integer number
+		 * gets rid of performance artifacts and reduces oversleeping.
+		 */
+		if (delay < MAX_SPINLOCK_DELAY &&
+		    (!(inc.head & 3) == 0 || (inc.head & 7) == 1))
+			delay++;
+	}
+	__this_cpu_write(spinlock_delay[slot], delay);
+}
+
+/*
  * this function sends a 'reschedule' IPI to another CPU.
  * it goes straight through and wastes no time serializing
  * anything. Worst case is that we lose a reschedule ...



^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2012-12-26 19:10         ` Eric Dumazet
@ 2012-12-26 19:27           ` Eric Dumazet
  2012-12-26 19:51           ` Rik van Riel
  2012-12-29 10:27           ` Michel Lespinasse
  2 siblings, 0 replies; 53+ messages in thread
From: Eric Dumazet @ 2012-12-26 19:27 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Steven Rostedt, linux-kernel, aquini, walken, lwoodman, jeremy,
	Jan Beulich, Thomas Gleixner, Tom Herbert

On Wed, 2012-12-26 at 11:10 -0800, Eric Dumazet wrote:

> +#define DELAY_HASH_SHIFT 4
> +DEFINE_PER_CPU(int [1 << DELAY_HASH_SHIFT], spinlock_delay) = { 
> +	MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
> +	MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
> +	MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
> +	MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
> +	MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
> +	MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
> +	MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
> +	MIN_SPINLOCK_DELAY, MIN_SPINLOCK_DELAY,
> +};

This can use the following initialize by the way :

DEFINE_PER_CPU(int [1 << DELAY_HASH_SHIFT], spinlock_delay) = {
        [0 ... (1 << DELAY_HASH_SHIFT) - 1] = MIN_SPINLOCK_DELAY,
};




^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2012-12-26 19:10         ` Eric Dumazet
  2012-12-26 19:27           ` Eric Dumazet
@ 2012-12-26 19:51           ` Rik van Riel
  2012-12-27  6:07             ` Michel Lespinasse
  2012-12-29 10:27           ` Michel Lespinasse
  2 siblings, 1 reply; 53+ messages in thread
From: Rik van Riel @ 2012-12-26 19:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Steven Rostedt, linux-kernel, aquini, walken, lwoodman, jeremy,
	Jan Beulich, Thomas Gleixner, Tom Herbert

On 12/26/2012 02:10 PM, Eric Dumazet wrote:

> I did some tests with your patches with following configuration :
>
> tc qdisc add dev eth0 root htb r2q 1000 default 3
> (to force a contention on qdisc lock, even with a multi queue net
> device)
>
> and 24 concurrent "netperf -t UDP_STREAM -H other_machine -- -m 128"
>
> Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
> (24 threads), and a fast NIC (10Gbps)
>
> Resulting in a 13 % regression (676 Mbits -> 595 Mbits)
>
> In this workload we have at least two contended spinlocks, with
> different delays. (spinlocks are not held for the same duration)
>
> It clearly defeats your assumption of a single per cpu delay being OK :
> Some cpus are spinning too long while the lock was released.

Thank you for breaking my patches.

I had been thinking about ways to deal with multiple
spinlocks, and hoping there would not be a serious
issue with systems contending on multiple locks.

> We might try to use a hash on lock address, and an array of 16 different
> delays so that different spinlocks have a chance of not sharing the same
> delay.
>
> With following patch, I get 982 Mbits/s with same bench, so an increase
> of 45 % instead of a 13 % regression.

Thank you even more for fixing my patches :)

That is a huge win!

Could I have your Signed-off-by: line, so I can merge
your hashed spinlock slots in?

I will probably keep it as a separate patch 4/4, with
your report and performance numbers in it, to preserve
the reason why we keep multiple hashed values, etc...

There is enough stuff in this code that will be
indishinguishable from magic if we do not document it
properly...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2012-12-26 19:51           ` Rik van Riel
@ 2012-12-27  6:07             ` Michel Lespinasse
  2012-12-27 14:27               ` Eric Dumazet
  0 siblings, 1 reply; 53+ messages in thread
From: Michel Lespinasse @ 2012-12-27  6:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Eric Dumazet, Steven Rostedt, linux-kernel, aquini, lwoodman,
	jeremy, Jan Beulich, Thomas Gleixner, Tom Herbert

On Wed, Dec 26, 2012 at 11:51 AM, Rik van Riel <riel@redhat.com> wrote:
> On 12/26/2012 02:10 PM, Eric Dumazet wrote:
>> We might try to use a hash on lock address, and an array of 16 different
>> delays so that different spinlocks have a chance of not sharing the same
>> delay.
>>
>> With following patch, I get 982 Mbits/s with same bench, so an increase
>> of 45 % instead of a 13 % regression.

Awesome :)

> I will probably keep it as a separate patch 4/4, with
> your report and performance numbers in it, to preserve
> the reason why we keep multiple hashed values, etc...
>
> There is enough stuff in this code that will be
> indishinguishable from magic if we do not document it
> properly...

If we go with per-spinlock tunings, I feel we'll most likely want to
add an associative cache in order to avoid the 1/16 chance (~6%) of
getting 595Mbit/s instead of 982Mbit/s when there is a hash collision.

I would still prefer if we could make up something that didn't require
per-spinlock tunings, but it's not clear if that'll work. At least we
now know of a simple enough workload to figure it out :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2012-12-27  6:07             ` Michel Lespinasse
@ 2012-12-27 14:27               ` Eric Dumazet
  2012-12-27 14:35                 ` Rik van Riel
  0 siblings, 1 reply; 53+ messages in thread
From: Eric Dumazet @ 2012-12-27 14:27 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Rik van Riel, Steven Rostedt, linux-kernel, aquini, lwoodman,
	jeremy, Jan Beulich, Thomas Gleixner, Tom Herbert

On Wed, 2012-12-26 at 22:07 -0800, Michel Lespinasse wrote:

> If we go with per-spinlock tunings, I feel we'll most likely want to
> add an associative cache in order to avoid the 1/16 chance (~6%) of
> getting 595Mbit/s instead of 982Mbit/s when there is a hash collision.
> 
> I would still prefer if we could make up something that didn't require
> per-spinlock tunings, but it's not clear if that'll work. At least we
> now know of a simple enough workload to figure it out :)

Even with a per spinlock tuning, we can find workloads where holding
time depends on the context.

For example, complex qdisc hierarchy typically use different times on
enqueue and dequeue operations.

So the hash sounds good to me, because the hash key could mix both lock
address and caller IP ( __builtin_return_address(1) in
ticket_spin_lock_wait())






^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2012-12-27 14:27               ` Eric Dumazet
@ 2012-12-27 14:35                 ` Rik van Riel
  2012-12-27 18:41                   ` Jan Beulich
  2012-12-27 18:49                   ` Eric Dumazet
  0 siblings, 2 replies; 53+ messages in thread
From: Rik van Riel @ 2012-12-27 14:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michel Lespinasse, Steven Rostedt, linux-kernel, aquini, lwoodman,
	jeremy, Jan Beulich, Thomas Gleixner, Tom Herbert

On 12/27/2012 09:27 AM, Eric Dumazet wrote:
> On Wed, 2012-12-26 at 22:07 -0800, Michel Lespinasse wrote:
>
>> If we go with per-spinlock tunings, I feel we'll most likely want to
>> add an associative cache in order to avoid the 1/16 chance (~6%) of
>> getting 595Mbit/s instead of 982Mbit/s when there is a hash collision.
>>
>> I would still prefer if we could make up something that didn't require
>> per-spinlock tunings, but it's not clear if that'll work. At least we
>> now know of a simple enough workload to figure it out :)
>
> Even with a per spinlock tuning, we can find workloads where holding
> time depends on the context.
>
> For example, complex qdisc hierarchy typically use different times on
> enqueue and dequeue operations.
 >
> So the hash sounds good to me, because the hash key could mix both lock
> address and caller IP ( __builtin_return_address(1) in
> ticket_spin_lock_wait())

The lock acquisition time depends on the holder of the lock,
and what the CPUs ahead of us in line will do with the lock,
not on the caller IP of the spinner.

Therefore, I am not convinced that hashing on the caller IP
will add much, if anything, except increasing the chance
that we end up not backing off when we should...

IMHO it would be good to try keeping this solution as simple
as we can get away with.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2012-12-27 14:35                 ` Rik van Riel
@ 2012-12-27 18:41                   ` Jan Beulich
  2012-12-27 19:09                     ` Rik van Riel
  2012-12-27 18:49                   ` Eric Dumazet
  1 sibling, 1 reply; 53+ messages in thread
From: Jan Beulich @ 2012-12-27 18:41 UTC (permalink / raw)
  To: eric.dumazet, riel
  Cc: rostedt, therbert, walken, jeremy, tglx, aquini, lwoodman,
	linux-kernel

>>> Rik van Riel <riel@redhat.com> 12/27/12 4:01 PM >>>
>On 12/27/2012 09:27 AM, Eric Dumazet wrote:
>> So the hash sounds good to me, because the hash key could mix both lock
>> address and caller IP ( __builtin_return_address(1) in
>> ticket_spin_lock_wait())
>
>The lock acquisition time depends on the holder of the lock,
>and what the CPUs ahead of us in line will do with the lock,
>not on the caller IP of the spinner.

The lock holder could supply its __builtin_return_address(0) for use
in eventual hashing.

Also, with all of this - did you evaluate the alternative of using
monitor/mwait instead?

Jan


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2012-12-27 14:35                 ` Rik van Riel
  2012-12-27 18:41                   ` Jan Beulich
@ 2012-12-27 18:49                   ` Eric Dumazet
  2012-12-27 19:31                     ` Rik van Riel
  1 sibling, 1 reply; 53+ messages in thread
From: Eric Dumazet @ 2012-12-27 18:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michel Lespinasse, Steven Rostedt, linux-kernel, aquini, lwoodman,
	jeremy, Jan Beulich, Thomas Gleixner, Tom Herbert

On Thu, 2012-12-27 at 09:35 -0500, Rik van Riel wrote:

> 
> The lock acquisition time depends on the holder of the lock,
> and what the CPUs ahead of us in line will do with the lock,
> not on the caller IP of the spinner.
> 

That would be true only for general cases.

In network land, we do have spinlock acquisition time depending on the
context.

A garbage collector usually runs for longer time than the regular fast
path.

But even without gc, its pretty often we have consumer/producers that
don't have the same amount of work to perform per lock/unlock sections.

The socket lock per example, might be held for very small sections for
process contexts (lock_sock() / release_sock()), but longer sections
from softirq context. Of course, severe lock contention on a socket
seems unlikely in real workloads.

> Therefore, I am not convinced that hashing on the caller IP
> will add much, if anything, except increasing the chance
> that we end up not backing off when we should...
> 
> IMHO it would be good to try keeping this solution as simple
> as we can get away with.
> 

unsigned long hash = (unsigned long)lock ^
                     (unsigned long)__builtin_return_address(1);

seems simple enough to me, but I get your point.

I also recorded the max 'delay' value reached on my machine to check how
good MAX_SPINLOCK_DELAY value was :

[   89.628265] cpu 16 delay 3710
[   89.631230] cpu 6 delay 2930
[   89.634120] cpu 15 delay 3186
[   89.637092] cpu 18 delay 3789
[   89.640071] cpu 22 delay 4012
[   89.643080] cpu 11 delay 3389
[   89.646057] cpu 21 delay 3123
[   89.649035] cpu 9 delay 3295
[   89.651931] cpu 3 delay 3063
[   89.654811] cpu 14 delay 3335

Although it makes no performance difference to use a bigger/smaller one.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2012-12-27 18:41                   ` Jan Beulich
@ 2012-12-27 19:09                     ` Rik van Riel
  2013-01-03  9:05                       ` Jan Beulich
  0 siblings, 1 reply; 53+ messages in thread
From: Rik van Riel @ 2012-12-27 19:09 UTC (permalink / raw)
  To: Jan Beulich
  Cc: eric.dumazet, rostedt, therbert, walken, jeremy, tglx, aquini,
	lwoodman, linux-kernel

On 12/27/2012 01:41 PM, Jan Beulich wrote:
>>>> Rik van Riel <riel@redhat.com> 12/27/12 4:01 PM >>>
>> On 12/27/2012 09:27 AM, Eric Dumazet wrote:
>>> So the hash sounds good to me, because the hash key could mix both lock
>>> address and caller IP ( __builtin_return_address(1) in
>>> ticket_spin_lock_wait())
>>
>> The lock acquisition time depends on the holder of the lock,
>> and what the CPUs ahead of us in line will do with the lock,
>> not on the caller IP of the spinner.
>
> The lock holder could supply its __builtin_return_address(0) for use
> in eventual hashing.
>
> Also, with all of this - did you evaluate the alternative of using
> monitor/mwait instead?

How much bus traffic do monitor/mwait cause behind the scenes?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2012-12-27 18:49                   ` Eric Dumazet
@ 2012-12-27 19:31                     ` Rik van Riel
  2012-12-29  0:42                       ` Eric Dumazet
  0 siblings, 1 reply; 53+ messages in thread
From: Rik van Riel @ 2012-12-27 19:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michel Lespinasse, Steven Rostedt, linux-kernel, aquini, lwoodman,
	jeremy, Jan Beulich, Thomas Gleixner, Tom Herbert

On 12/27/2012 01:49 PM, Eric Dumazet wrote:
> On Thu, 2012-12-27 at 09:35 -0500, Rik van Riel wrote:
>
>>
>> The lock acquisition time depends on the holder of the lock,
>> and what the CPUs ahead of us in line will do with the lock,
>> not on the caller IP of the spinner.
>
> That would be true only for general cases.
>
> In network land, we do have spinlock acquisition time depending on the
> context.
>
> A garbage collector usually runs for longer time than the regular fast
> path.

Won't the garbage collector running, hold up the lock
acquisition time by OTHER acquirers?

> But even without gc, its pretty often we have consumer/producers that
> don't have the same amount of work to perform per lock/unlock sections.
>
> The socket lock per example, might be held for very small sections for
> process contexts (lock_sock() / release_sock()), but longer sections
> from softirq context. Of course, severe lock contention on a socket
> seems unlikely in real workloads.

If one actor holds the lock for longer than the
others, surely it would be the others that suffer
in lock acquisition time?

>> Therefore, I am not convinced that hashing on the caller IP
>> will add much, if anything, except increasing the chance
>> that we end up not backing off when we should...
>>
>> IMHO it would be good to try keeping this solution as simple
>> as we can get away with.
>>
>
> unsigned long hash = (unsigned long)lock ^
>                       (unsigned long)__builtin_return_address(1);
>
> seems simple enough to me, but I get your point.
>
> I also recorded the max 'delay' value reached on my machine to check how
> good MAX_SPINLOCK_DELAY value was :
>
> [   89.628265] cpu 16 delay 3710
> [   89.631230] cpu 6 delay 2930
> [   89.634120] cpu 15 delay 3186
> [   89.637092] cpu 18 delay 3789
> [   89.640071] cpu 22 delay 4012
> [   89.643080] cpu 11 delay 3389
> [   89.646057] cpu 21 delay 3123
> [   89.649035] cpu 9 delay 3295
> [   89.651931] cpu 3 delay 3063
> [   89.654811] cpu 14 delay 3335
>
> Although it makes no performance difference to use a bigger/smaller one.

I guess we want a larger value.

With your hashed lock approach, we can get away with
larger values - they will not penalize other locks
the same way a single value per cpu might have.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2012-12-27 19:31                     ` Rik van Riel
@ 2012-12-29  0:42                       ` Eric Dumazet
  0 siblings, 0 replies; 53+ messages in thread
From: Eric Dumazet @ 2012-12-29  0:42 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michel Lespinasse, Steven Rostedt, linux-kernel, aquini, lwoodman,
	jeremy, Jan Beulich, Thomas Gleixner, Tom Herbert

On Thu, 2012-12-27 at 14:31 -0500, Rik van Riel wrote:
> to use a bigger/smaller one.
> 
> I guess we want a larger value.
> 
> With your hashed lock approach, we can get away with
> larger values - they will not penalize other locks
> the same way a single value per cpu might have.

Then, we absolutely want to detect hash collisions to clear the (wrong)
estimation or else we might 'pollute' a spinlock with a delay of a very
slow spinlock.

In my tests, the mm zone lock can be held for very long for example...

[  657.439995] cpu 18 lock ffff88067fffeb40 delay 6906
[  657.444855] ------------[ cut here ]------------
[  657.444859] WARNING: at arch/x86/kernel/smp.c:170
ticket_spin_lock_wait+0xf9/0x100()
[  657.444860] Hardware name: TBG,ICH10
[  657.444861] Modules linked in: msr cpuid genrtc mlx4_en ib_uverbs
mlx4_ib ib_sa ib_mad ib_core mlx4_core e1000e bnx2x libcrc32c mdio ipv6
[  657.444871] Pid: 24942, comm: hotplug Tainted: G        W
3.8.0-smp-DEV #31
[  657.444872] Call Trace:
[  657.444876]  [<ffffffff8108634f>] warn_slowpath_common+0x7f/0xc0
[  657.444878]  [<ffffffff810863aa>] warn_slowpath_null+0x1a/0x20
[  657.444881]  [<ffffffff81070089>] ticket_spin_lock_wait+0xf9/0x100
[  657.444885]  [<ffffffff815ad58f>] _raw_spin_lock_irqsave+0x2f/0x40
[  657.444887]  [<ffffffff8113f5d0>] release_pages+0x160/0x220
[  657.444891]  [<ffffffff8116cffe>] free_pages_and_swap_cache+0x9e/0xc0
[  657.444893]  [<ffffffff8107f8a8>] ? flush_tlb_mm_range+0x48/0x220
[  657.444896]  [<ffffffff81158ee7>] tlb_flush_mmu+0x67/0xb0
[  657.444898]  [<ffffffff81158f4c>] tlb_finish_mmu+0x1c/0x50
[  657.444900]  [<ffffffff81163346>] exit_mmap+0xf6/0x170
[  657.444903]  [<ffffffff81083737>] mmput+0x47/0xf0
[  657.444906]  [<ffffffff8108baed>] do_exit+0x24d/0xa20
[  657.444908]  [<ffffffff810986af>] ? recalc_sigpending+0x1f/0x60
[  657.444910]  [<ffffffff81098db7>] ? __set_task_blocked+0x37/0x80
[  657.444913]  [<ffffffff8108c354>] do_group_exit+0x44/0xa0
[  657.444915]  [<ffffffff8108c3c7>] sys_exit_group+0x17/0x20
[  657.444918]  [<ffffffff815b68a5>] sysenter_dispatch+0x7/0x1a
[  657.444920] ---[ end trace a460fe18a5578dda ]---

My current function looks like :

 /*
+ * Wait on a congested ticket spinlock.
+ */
+#define MIN_SPINLOCK_DELAY 1
+#define MAX_SPINLOCK_DELAY 10000
+#define DELAY_HASH_SHIFT 6
+struct delay_entry {
+       u32 hash;
+       u32 delay;
+};
+static DEFINE_PER_CPU(struct delay_entry [1 << DELAY_HASH_SHIFT], spinlock_delay) = {
+       [0 ... (1 << DELAY_HASH_SHIFT) - 1] = {
+               .hash = 0,
+               .delay = MIN_SPINLOCK_DELAY,
+       },
+};
+static DEFINE_PER_CPU(u16, maxdelay);
+
+void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
+{
+       u32 hash = hash32_ptr(lock);
+       u32 slot = hash_32(hash, DELAY_HASH_SHIFT);
+       struct delay_entry *ent = &__get_cpu_var(spinlock_delay[slot]);
+       u32 delay = (ent->hash == hash) ? ent->delay : MIN_SPINLOCK_DELAY;
+
+       for (;;) {
+               u32 loops = delay * (__ticket_t)(inc.tail - inc.head);
+
+               loops -= delay >> 1;
+               while (loops--)
+                       cpu_relax();
+
+               inc.head = ACCESS_ONCE(lock->tickets.head);
+
+               if (inc.head == inc.tail) {
+                       /* Decrease the delay, since we may have overslept. */
+                       if (delay > MIN_SPINLOCK_DELAY)
+                               delay--;
+                       break;
+               }
+
+               /*
+                * The lock is still busy, the delay was not long enough.
+                * Going through here 2.7 times will, on average, cancel
+                * out the decrement above. Using a non-integer number
+                * gets rid of performance artifacts and reduces oversleeping.
+                */
+               if (delay < MAX_SPINLOCK_DELAY &&
+                   (!(inc.head & 3) == 0 || (inc.head & 7) == 1))
+                       delay++;
+       }
+       if (__this_cpu_read(maxdelay) < delay) {
+               pr_err("cpu %d lock %p delay %d\n", smp_processor_id(), lock, delay);
+               __this_cpu_write(maxdelay, delay);
+               WARN_ON(1);
+       }
+       ent->hash = hash;
+       ent->delay = delay;
+}
+



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2012-12-26 19:10         ` Eric Dumazet
  2012-12-26 19:27           ` Eric Dumazet
  2012-12-26 19:51           ` Rik van Riel
@ 2012-12-29 10:27           ` Michel Lespinasse
  2013-01-03 18:17             ` Eric Dumazet
  2 siblings, 1 reply; 53+ messages in thread
From: Michel Lespinasse @ 2012-12-29 10:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rik van Riel, Steven Rostedt, linux-kernel, aquini, lwoodman,
	jeremy, Jan Beulich, Thomas Gleixner, Tom Herbert

On Wed, Dec 26, 2012 at 11:10 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> I did some tests with your patches with following configuration :
>
> tc qdisc add dev eth0 root htb r2q 1000 default 3
> (to force a contention on qdisc lock, even with a multi queue net
> device)
>
> and 24 concurrent "netperf -t UDP_STREAM -H other_machine -- -m 128"
>
> Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
> (24 threads), and a fast NIC (10Gbps)
>
> Resulting in a 13 % regression (676 Mbits -> 595 Mbits)

I've been trying to use this workload on a similar machine. I am
getting some confusing results however:

with 24 concurrent netperf -t UDP_STREAM -H $target -- -m 128 -R 1 , I
am seeing some non-trivial run-to-run performance variation - about 5%
in v3.7 baseline, but very significant after applying rik's 3 patches.
my last few runs gave me results of 890.92, 1073.74, 963.13, 1234.41,
754.18, 893.82. This is generally better than what I'm getting with
baseline, but the variance is huge (which is somewhat surprising given
that rik's patches don't have the issue of hash collisions). Also,
this is significant in that I am not seeing the regression you were
observing with just these 3 patches.

If I add a 1 second delay in the netperf command line (netperf -t
UDP_STREAM -s 1 -H lpk18 -- -m 128 -R 1), I am seeing a very constant
660 Mbps result, but then I don't see any benefit from applying rik's
patches. I have no explanation for these results, but I am getting
them very consistently...

> In this workload we have at least two contended spinlocks, with
> different delays. (spinlocks are not held for the same duration)

Just to confirm, I believe you are refering to qdisc->q.lock and
qdisc->busylock ?

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* ticket spinlock proportional backoff experiments
  2012-12-21 23:49 [RFC PATCH 0/3] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
                   ` (2 preceding siblings ...)
  2012-12-21 23:51 ` [RFC PATCH 3/3] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
@ 2013-01-02  0:06 ` Michel Lespinasse
  2013-01-02  0:09   ` [PATCH 1/2] x86,smp: simplify __ticket_spin_lock Michel Lespinasse
  2013-01-02  0:10   ` [PATCH 2/2] x86,smp: proportional backoff for ticket spinlocks Michel Lespinasse
  3 siblings, 2 replies; 53+ messages in thread
From: Michel Lespinasse @ 2013-01-02  0:06 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner, Eric Dumazet, Tom Herbert

[-- Attachment #1: Type: text/plain, Size: 6845 bytes --]

On Fri, Dec 21, 2012 at 3:49 PM, Rik van Riel <riel@redhat.com> wrote:
> Many spinlocks are embedded in data structures; having many CPUs
> pounce on the cache line the lock is in will slow down the lock
> holder, and can cause system performance to fall off a cliff.
>
> The paper "Non-scalable locks are dangerous" is a good reference:
>         http://pdos.csail.mit.edu/papers/linux:lock.pdf
>
> Proportional delay in ticket locks is delaying the time between
> checking the ticket based on a delay factor, and the number of
> CPUs ahead of us in the queue for this lock. Checking the lock
> less often allows the lock holder to continue running, resulting
> in better throughput and preventing performance from dropping
> off a cliff.

So as I mentioned before, I had a few ideas about trying to remove any
per-spinlock tunings from the equation and see if we can get good
performance with just one per-system tuning value. I recently had time
to run these experiments, so here are my current results.

I have based my experiments on v3.7 and used the synchro-test module
from Andrew's tree as an easily understandable synthetic workload (I
had to extend synchro-test to add a spinlock test, which I posted
separately to lkml a couple days ago: "[PATCH 0/2] extend synchro-test
module to test spinlocks too")

I tested 3 software configurations:

base: v3.7 + synchro-test ("mutex subsystem, synchro-test module",
"mutex-subsystem-synchro-test-module-fix",
"mutex-subsystem-synchro-test-module-fix-2", "kernel: fix
synchro-test.c printk format warrnings" and "add spinlock test to
synchro-test module")

auto: base + Rik's proposed patches ("x86,smp: move waiting on
contended ticket lock out of line", "x86,smp: proportional backoff for
ticket spinlocks", "x86,smp: auto tune spinlock backoff delay factor")

mine: base + Rik's "x86,smp: move waiting on contended ticket lock out
of line" + my two patches (which I'll attach as replies) "x86,smp:
simplify __ticket_spin_lock" (should be just cosmetic) and my own
version of "x86,smp: proportional backoff for ticket spinlocks". This
version differ's from Rik's principally by exporting the system-wide
tuning value through debugfs, and by using a delay of (ticket - head -
1) * spinlock_delay instead of just (ticket - head) *
spinlock_delay. Since spinlock_delay is a tunable parameter in my
experiments, I have been annotating my results with the spinlock_delay
value used in the experiment.

I have been testing on 3 machines:
mach1: 4 socket AMD Barcelona (16 cores total)
mach2: 2 socket Intel Westmere (12 cores / 24 threads total)
mach3: 2 socket Intel Sandybridge (16 cores / 32 threads total)

I ran two kinds of synthetic workloads using the synchro-test module:

noload: just N threads taking/releasing a single spinlock as fast as possible

load: N threads taking/releasing a single spinlock with 2uS loops
between each operation (2uS with spinlock held, 2uS with spinlock
released).

The test script looks like the following:
for i in 1 2 3 4 6 8 12 16 20 24 28 32; do
  printf "%2d spinlockers: " $i
  modprobe synchro-test sp=$i load=0 interval=0 do_sched=1 v=1 2>/dev/null || \
    true
  dmesg | sed 's/.*] //' | awk '/^spinlocks taken: / {sp=$3} END {print sp}'
done

(This is for the noload workload; the load workload is similar except for the load=0 interval=0 parameters being removed)

For each machine, I collected the noload performance results
first. When testing my patches, I did several runs and manually
adjusted the spinlock_delay parameter in order to get the best
performance. There is a bit of a line to walk there: too high and we
oversleep, thus reducing the lock utilization; too low and we do
excessive polling which reduces the lock throughput. However the tests
show that there is a small range of parameter values (~375 on mach1,
~75 on mach2, ~45 on mach3) which result in good performance
independently of the number of spinlocker threads.

mach3 was actually the tougher machine to tune for here. A tuning
value of 40 would have actually resulted in slightly higher
performance at 24-32 threads, and a value of 60 would have helped a
bit in the 6-12 threads range. However, we are only talking about
small percentage differences here - I think the main goal is to avoid
any large performance cliffs, and we certainly achieve that (note that
mach3 was already the best behaved of the 3 here)

I then collected the "load" performance results (with a 2uS load with
spinlock held, and a 2uS interval with spinlock released). On each
machine, I collected baseline, auto (rik's proposal), my patches with
the spinlock_delay value that had been determined best for the
"noload" workload, and then I tried to manually tune the
spinlock_delay value for best performance on the "load" workload.

Analysis:

- In the "noload" workload, my proposal can (with some hand tuning)
  result in better performance than rik's autotuned approach on mach1
  and mach2, and be at rough parity on mach3. I believe the benefits
  on mach1 and mach2 are due to achieving even lower coherency traffic
  than rik's autotuned target of ~3.7 accesses per contended spinlock
  acquisition. As it turns out, it was possible to find a single
  spinlock_delay value that works well no mater how many spinlocker
  threads are running. On this point, I only hit partial success on
  mach3, as I was able to find a "decent all-around" value for
  spinlock_delay that reached parity with Rik's proposal, but did not
  beat it (and it seems that the optimal value did depend on the
  amount of spinlocker threads in use)

- In the "load" workload, Rik's proposal performs poorly. This is
  especially apparent on mach2 and mach3, where the performance
  degradation due to coherency traffic in the baseline run wasn't very
  high in the first place; the extra delays introduced to try and
  avoid that coherency traffic end up hurting more than they help.

- In my proposal, tuning for the "noload" workload and applying that
  to the "load" workload seems to work well enough. It may be possible
  to get a few extra percent performance by using per-spinlock
  tunings, but the difference is low enough that I would think it's
  not worth the extra complexity. I like the simplicity of having one
  single per-system tuning value that works well for any spinlocks it
  might get applied to.

Of course, one major downside in my proposal is that I haven't figured
out an automatic way to find the most appropriate spinlock_delay
system tunable. So there is clearly some more experimentation needed
there. However, IMO the important result here is that our goal of
avoiding performance cliffs seems to be reachable without the
complexity (and IMO, risk) of per-spinlock tuning values.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

[-- Attachment #2: mach1-load.png --]
[-- Type: image/png, Size: 4335 bytes --]

[-- Attachment #3: mach1-noload.png --]
[-- Type: image/png, Size: 4116 bytes --]

[-- Attachment #4: mach2-load.png --]
[-- Type: image/png, Size: 4428 bytes --]

[-- Attachment #5: mach2-noload.png --]
[-- Type: image/png, Size: 4403 bytes --]

[-- Attachment #6: mach3-load.png --]
[-- Type: image/png, Size: 4399 bytes --]

[-- Attachment #7: mach3-noload.png --]
[-- Type: image/png, Size: 3927 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 1/2] x86,smp: simplify __ticket_spin_lock
  2013-01-02  0:06 ` ticket spinlock proportional backoff experiments Michel Lespinasse
@ 2013-01-02  0:09   ` Michel Lespinasse
  2013-01-02 15:31     ` Rik van Riel
  2013-01-02  0:10   ` [PATCH 2/2] x86,smp: proportional backoff for ticket spinlocks Michel Lespinasse
  1 sibling, 1 reply; 53+ messages in thread
From: Michel Lespinasse @ 2013-01-02  0:09 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner, Eric Dumazet, Tom Herbert

Just cosmetic - avoid an unnecessary goto construct

Signed-off-by: Michel Lespinasse <walken@google.com>

---
 arch/x86/include/asm/spinlock.h |    7 ++-----
 1 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 2a45eb0cdb2c..19e8a36b3b83 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -55,11 +55,8 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
 
 	inc = xadd(&lock->tickets, inc);
 
-	if (inc.head == inc.tail)
-		goto out;
-
-	ticket_spin_lock_wait(lock, inc);
- out:
+	if (inc.head != inc.tail)
+		ticket_spin_lock_wait(lock, inc);
 	barrier();		/* make sure nothing creeps before the lock is taken */
 }
 
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 2/2] x86,smp: proportional backoff for ticket spinlocks
  2013-01-02  0:06 ` ticket spinlock proportional backoff experiments Michel Lespinasse
  2013-01-02  0:09   ` [PATCH 1/2] x86,smp: simplify __ticket_spin_lock Michel Lespinasse
@ 2013-01-02  0:10   ` Michel Lespinasse
  1 sibling, 0 replies; 53+ messages in thread
From: Michel Lespinasse @ 2013-01-02  0:10 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, aquini, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner, Eric Dumazet, Tom Herbert

Simple fixed value proportional backoff for ticket spinlocks.
By pounding on the cacheline with the spin lock less often,
bus traffic is reduced. In cases of a data structure with
embedded spinlock, the lock holder has a better chance of
making progress.

Note that when a thread notices that it's at the head of the line for
acquiring the spinlock, this thread has already touched the spinlock's
cacheline and now holds it in shared state. At this point, extra reads
to the cache line are local to the processor and do not generate any
extra coherency traffic, until another thread (probably the spinlock
owner) writes to it. When that write occurs, the writing thread will
get exclusive access to the cacheline first, and then the waiting
thread will ask to get its shared access again. It is expected that in
many cases, the writing thread will release the spinlock before the
waiting thread gets its shared access back. For these reasons, it
seems unproductive for the head waiter to throttle its accesses;
however we do want to throttle the other waiter threads so that they
don't generate any extra coherency traffic until they can acquire the
spinlock, or at least reach the head position among waiters.

Signed-off-by: Michel Lespinasse <walken@google.com>

---
 arch/x86/include/asm/spinlock.h |    2 ++
 arch/x86/kernel/smp.c           |   33 +++++++++++++++++++++++++++------
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 19e8a36b3b83..b49ae57a62c8 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -34,6 +34,8 @@
 # define UNLOCK_LOCK_PREFIX
 #endif

+extern unsigned int __read_mostly spinlock_delay;
+
 extern void ticket_spin_lock_wait(arch_spinlock_t *, struct __raw_tickets);

 /*
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 20da35427bd5..eb2c49c6cc08 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -23,6 +23,7 @@
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
 #include <linux/gfp.h>
+#include <linux/debugfs.h>

 #include <asm/mtrr.h>
 #include <asm/tlbflush.h>
@@ -111,20 +112,40 @@

 static atomic_t stopping_cpu = ATOMIC_INIT(-1);
 static bool smp_no_nmi_ipi = false;
+unsigned int __read_mostly spinlock_delay = 1;

 /*
  * Wait on a congested ticket spinlock.
  */
 void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
 {
-	for (;;) {
-		cpu_relax();
-		inc.head = ACCESS_ONCE(lock->tickets.head);
+	__ticket_t head = inc.head, ticket = inc.tail;
+	__ticket_t waiters_ahead;
+	unsigned delay;
+	do {
+		waiters_ahead = ticket - head - 1;
+		if (!waiters_ahead) {
+			do
+				cpu_relax();
+			while (ACCESS_ONCE(lock->tickets.head) != ticket);
+			return;
+		}
+		delay = waiters_ahead * spinlock_delay;
+		do
+			cpu_relax();
+		while (delay--);
+		head = ACCESS_ONCE(lock->tickets.head);
+	} while (head != ticket);
+}

-		if (inc.head == inc.tail)
-			break;
-	}
+#ifdef CONFIG_DEBUG_FS
+static __init int spinlock_delay_init_debug(void)
+{
+	debugfs_create_u32("spinlock_delay", 0644, NULL, &spinlock_delay);
+	return 0;
 }
+late_initcall(spinlock_delay_init_debug);
+#endif

 /*
  * this function sends a 'reschedule' IPI to another CPU.
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 1/2] x86,smp: simplify __ticket_spin_lock
  2013-01-02  0:09   ` [PATCH 1/2] x86,smp: simplify __ticket_spin_lock Michel Lespinasse
@ 2013-01-02 15:31     ` Rik van Riel
  0 siblings, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2013-01-02 15:31 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: linux-kernel, aquini, lwoodman, jeremy, Jan Beulich,
	Thomas Gleixner, Eric Dumazet, Tom Herbert

On 01/01/2013 07:09 PM, Michel Lespinasse wrote:
> Just cosmetic - avoid an unnecessary goto construct
>
> Signed-off-by: Michel Lespinasse <walken@google.com>

I have had this in my code base since you first suggested
it on December 21st. Thank you for sending in more improvements,
though :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2012-12-27 19:09                     ` Rik van Riel
@ 2013-01-03  9:05                       ` Jan Beulich
  2013-01-03 13:24                         ` Steven Rostedt
  0 siblings, 1 reply; 53+ messages in thread
From: Jan Beulich @ 2013-01-03  9:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: eric.dumazet, rostedt, therbert, walken, jeremy, tglx, aquini,
	lwoodman, linux-kernel

>>> On 27.12.12 at 20:09, Rik van Riel <riel@redhat.com> wrote:
> On 12/27/2012 01:41 PM, Jan Beulich wrote:
>>>>> Rik van Riel <riel@redhat.com> 12/27/12 4:01 PM >>>
>>> On 12/27/2012 09:27 AM, Eric Dumazet wrote:
>>>> So the hash sounds good to me, because the hash key could mix both lock
>>>> address and caller IP ( __builtin_return_address(1) in
>>>> ticket_spin_lock_wait())
>>>
>>> The lock acquisition time depends on the holder of the lock,
>>> and what the CPUs ahead of us in line will do with the lock,
>>> not on the caller IP of the spinner.
>>
>> The lock holder could supply its __builtin_return_address(0) for use
>> in eventual hashing.
>>
>> Also, with all of this - did you evaluate the alternative of using
>> monitor/mwait instead?
> 
> How much bus traffic do monitor/mwait cause behind the scenes?

I would suppose that this just snoops the bus for writes, but the
amount of bus traffic involved in this isn't explicitly documented.

One downside of course is that unless a spin lock is made occupy
exactly a cache line, false wakeups are possible.

Jan


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2013-01-03  9:05                       ` Jan Beulich
@ 2013-01-03 13:24                         ` Steven Rostedt
  2013-01-03 13:35                           ` Eric Dumazet
  0 siblings, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2013-01-03 13:24 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Rik van Riel, eric.dumazet, therbert, walken, jeremy, tglx,
	aquini, lwoodman, linux-kernel

On Thu, 2013-01-03 at 09:05 +0000, Jan Beulich wrote:

> > How much bus traffic do monitor/mwait cause behind the scenes?
> 
> I would suppose that this just snoops the bus for writes, but the
> amount of bus traffic involved in this isn't explicitly documented.
> 
> One downside of course is that unless a spin lock is made occupy
> exactly a cache line, false wakeups are possible.

And that would probably be very likely, as the whole purpose of Rik's
patches was to lower cache stalls due to other CPUs pounding on spin
locks that share the cache line of what is being protected (and
modified).

-- Steve



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2013-01-03 13:24                         ` Steven Rostedt
@ 2013-01-03 13:35                           ` Eric Dumazet
  2013-01-03 15:32                             ` Steven Rostedt
  0 siblings, 1 reply; 53+ messages in thread
From: Eric Dumazet @ 2013-01-03 13:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jan Beulich, Rik van Riel, therbert, walken, jeremy, tglx, aquini,
	lwoodman, linux-kernel

On Thu, 2013-01-03 at 08:24 -0500, Steven Rostedt wrote:
> On Thu, 2013-01-03 at 09:05 +0000, Jan Beulich wrote:
> 
> > > How much bus traffic do monitor/mwait cause behind the scenes?
> > 
> > I would suppose that this just snoops the bus for writes, but the
> > amount of bus traffic involved in this isn't explicitly documented.
> > 
> > One downside of course is that unless a spin lock is made occupy
> > exactly a cache line, false wakeups are possible.
> 
> And that would probably be very likely, as the whole purpose of Rik's
> patches was to lower cache stalls due to other CPUs pounding on spin
> locks that share the cache line of what is being protected (and
> modified).

A monitor/mwait would be an option only if using MCS (or K42 variant)
locks, where each cpu would wait on a private and dedicated cache line.




^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2013-01-03 13:35                           ` Eric Dumazet
@ 2013-01-03 15:32                             ` Steven Rostedt
  2013-01-03 16:10                               ` Eric Dumazet
  0 siblings, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2013-01-03 15:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jan Beulich, Rik van Riel, therbert, walken, jeremy, tglx, aquini,
	lwoodman, linux-kernel

On Thu, 2013-01-03 at 05:35 -0800, Eric Dumazet wrote:
> On Thu, 2013-01-03 at 08:24 -0500, Steven Rostedt wrote:
> > On Thu, 2013-01-03 at 09:05 +0000, Jan Beulich wrote:
> > 
> > > > How much bus traffic do monitor/mwait cause behind the scenes?
> > > 
> > > I would suppose that this just snoops the bus for writes, but the
> > > amount of bus traffic involved in this isn't explicitly documented.
> > > 
> > > One downside of course is that unless a spin lock is made occupy
> > > exactly a cache line, false wakeups are possible.
> > 
> > And that would probably be very likely, as the whole purpose of Rik's
> > patches was to lower cache stalls due to other CPUs pounding on spin
> > locks that share the cache line of what is being protected (and
> > modified).
> 
> A monitor/mwait would be an option only if using MCS (or K42 variant)
> locks, where each cpu would wait on a private and dedicated cache line.


But then would the problem even exist? If the lock is on its own cache
line, it shouldn't cause a performance issue if other CPUs are spinning
on it. Would it?

-- Steve




^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2013-01-03 15:32                             ` Steven Rostedt
@ 2013-01-03 16:10                               ` Eric Dumazet
  2013-01-03 16:45                                 ` Steven Rostedt
  0 siblings, 1 reply; 53+ messages in thread
From: Eric Dumazet @ 2013-01-03 16:10 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jan Beulich, Rik van Riel, therbert, walken, jeremy, tglx, aquini,
	lwoodman, linux-kernel

On Thu, 2013-01-03 at 10:32 -0500, Steven Rostedt wrote:
> On Thu, 2013-01-03 at 05:35 -0800, Eric Dumazet wrote:
> > On Thu, 2013-01-03 at 08:24 -0500, Steven Rostedt wrote:
> > > On Thu, 2013-01-03 at 09:05 +0000, Jan Beulich wrote:
> > > 
> > > > > How much bus traffic do monitor/mwait cause behind the scenes?
> > > > 
> > > > I would suppose that this just snoops the bus for writes, but the
> > > > amount of bus traffic involved in this isn't explicitly documented.
> > > > 
> > > > One downside of course is that unless a spin lock is made occupy
> > > > exactly a cache line, false wakeups are possible.
> > > 
> > > And that would probably be very likely, as the whole purpose of Rik's
> > > patches was to lower cache stalls due to other CPUs pounding on spin
> > > locks that share the cache line of what is being protected (and
> > > modified).
> > 
> > A monitor/mwait would be an option only if using MCS (or K42 variant)
> > locks, where each cpu would wait on a private and dedicated cache line.
> 
> 
> But then would the problem even exist? If the lock is on its own cache
> line, it shouldn't cause a performance issue if other CPUs are spinning
> on it. Would it?

Not sure I understand the question.

The lock itself would not consume a whole cache line, only the items
chained on it would be percpu, and cache line aligned.

http://www.cs.rochester.edu/research/synchronization/pseudocode/ss.html#mcs

Instead of spinning in :

repeat while I->next = nil 

This part could use monitor/mwait


But :

1) We dont have such lock implementation

2) Trying to save power while waiting on a spinlock would be a clear
sign something is wrong in the implementation. A spinlock should not
protect a long critical section.



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2013-01-03 16:10                               ` Eric Dumazet
@ 2013-01-03 16:45                                 ` Steven Rostedt
  2013-01-03 17:54                                   ` Eric Dumazet
  0 siblings, 1 reply; 53+ messages in thread
From: Steven Rostedt @ 2013-01-03 16:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jan Beulich, Rik van Riel, therbert, walken, jeremy, tglx, aquini,
	lwoodman, linux-kernel

On Thu, 2013-01-03 at 08:10 -0800, Eric Dumazet wrote:

> > But then would the problem even exist? If the lock is on its own cache
> > line, it shouldn't cause a performance issue if other CPUs are spinning
> > on it. Would it?
> 
> Not sure I understand the question.
> 

I'll explain my question better.

I thought the whole point of Rik's patches was to solve a performance
problem caused by contention on a lock that shares a cache line with
data.

In the ideal case, locks wont be contented, and are taken and released
quickly (being from the RT world, I know this isn't true :-( ). In this
case, it's also advantageous to keep the lock on the same cache line as
the data that's being updated. This way, the process of grabbing the
lock also pulls in the data that you will soon be using.

But then the problem occurs when you have a bunch of other CPUs trying
to take this lock in a tight spin. Every time the owner of the lock
touches the data, the other CPUs doing a LOCK read on the spinlock will
cause bus contention on the owner CPU as the data shares the cache and
needs to be synced. As the owner CPU just touched the cache line that is
under a tight loop of LOCK reads on other CPUs. By adding the delays,
the CPU with the lock doesn't stall at every update of the data
protected by the lock.

Thus, if monitor/mwait is ideal only for locks on its own cache line,
then they are pointless for the locks that are causing the issue we are
trying to fix.

-- Steve

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2013-01-03 16:45                                 ` Steven Rostedt
@ 2013-01-03 17:54                                   ` Eric Dumazet
  0 siblings, 0 replies; 53+ messages in thread
From: Eric Dumazet @ 2013-01-03 17:54 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jan Beulich, Rik van Riel, therbert, walken, jeremy, tglx, aquini,
	lwoodman, linux-kernel

On Thu, 2013-01-03 at 11:45 -0500, Steven Rostedt wrote:
> On Thu, 2013-01-03 at 08:10 -0800, Eric Dumazet wrote:
> 
> > > But then would the problem even exist? If the lock is on its own cache
> > > line, it shouldn't cause a performance issue if other CPUs are spinning
> > > on it. Would it?
> > 
> > Not sure I understand the question.
> > 
> 
> I'll explain my question better.
> 
> I thought the whole point of Rik's patches was to solve a performance
> problem caused by contention on a lock that shares a cache line with
> data.
> 
> In the ideal case, locks wont be contented, and are taken and released
> quickly (being from the RT world, I know this isn't true :-( ). In this
> case, it's also advantageous to keep the lock on the same cache line as
> the data that's being updated. This way, the process of grabbing the
> lock also pulls in the data that you will soon be using.
> 
> But then the problem occurs when you have a bunch of other CPUs trying
> to take this lock in a tight spin. Every time the owner of the lock
> touches the data, the other CPUs doing a LOCK read on the spinlock will
> cause bus contention on the owner CPU as the data shares the cache and
> needs to be synced. As the owner CPU just touched the cache line that is
> under a tight loop of LOCK reads on other CPUs. By adding the delays,
> the CPU with the lock doesn't stall at every update of the data
> protected by the lock.
> 
> Thus, if monitor/mwait is ideal only for locks on its own cache line,
> then they are pointless for the locks that are causing the issue we are
> trying to fix.

I think you misunderstood the monitor/mwait usage I was speaking of

- Only for MCS type lock, where each cpu spins on it own busy/locked
bit.

Of course, if we use a ticket spinlock with no additional storage, we
have to spin without making any memory reference, and thats Rick's patch
using this idea : 

http://www.cs.rochester.edu/research/synchronization/pseudocode/ss.html#ticket




^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC PATCH 3/3 -v2] x86,smp: auto tune spinlock backoff delay factor
  2012-12-29 10:27           ` Michel Lespinasse
@ 2013-01-03 18:17             ` Eric Dumazet
  0 siblings, 0 replies; 53+ messages in thread
From: Eric Dumazet @ 2013-01-03 18:17 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Rik van Riel, Steven Rostedt, linux-kernel, aquini, lwoodman,
	jeremy, Jan Beulich, Thomas Gleixner, Tom Herbert

On Sat, 2012-12-29 at 02:27 -0800, Michel Lespinasse wrote:
> On Wed, Dec 26, 2012 at 11:10 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > I did some tests with your patches with following configuration :
> >
> > tc qdisc add dev eth0 root htb r2q 1000 default 3
> > (to force a contention on qdisc lock, even with a multi queue net
> > device)
> >
> > and 24 concurrent "netperf -t UDP_STREAM -H other_machine -- -m 128"
> >
> > Machine : 2 Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
> > (24 threads), and a fast NIC (10Gbps)
> >
> > Resulting in a 13 % regression (676 Mbits -> 595 Mbits)
> 
> I've been trying to use this workload on a similar machine. I am
> getting some confusing results however:
> 
> with 24 concurrent netperf -t UDP_STREAM -H $target -- -m 128 -R 1 , I
> am seeing some non-trivial run-to-run performance variation - about 5%
> in v3.7 baseline, but very significant after applying rik's 3 patches.
> my last few runs gave me results of 890.92, 1073.74, 963.13, 1234.41,
> 754.18, 893.82. This is generally better than what I'm getting with
> baseline, but the variance is huge (which is somewhat surprising given
> that rik's patches don't have the issue of hash collisions).

You mean that with Rik's patch, there is definitely an issue, as it has
a single bucket. Chances of collisions are high.


Your numbers being very random, I suspect you might hit another limit.

My tests involved a NIC with 24 transmit queues, to remove the per TX
queue lock out of the bench equation.

My guess is you use a NIC with 4 or 8 TX queues.

"ethtool -S eth0" would probably give some hints.

>  Also,
> this is significant in that I am not seeing the regression you were
> observing with just these 3 patches.
> 
> If I add a 1 second delay in the netperf command line (netperf -t
> UDP_STREAM -s 1 -H lpk18 -- -m 128 -R 1), I am seeing a very constant
> 660 Mbps result, but then I don't see any benefit from applying rik's
> patches. I have no explanation for these results, but I am getting
> them very consistently...
> 
> > In this workload we have at least two contended spinlocks, with
> > different delays. (spinlocks are not held for the same duration)
> 
> Just to confirm, I believe you are refering to qdisc->q.lock and
> qdisc->busylock ?
> 

Yes.




^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2013-01-03 18:23 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-21 23:49 [RFC PATCH 0/3] x86,smp: make ticket spinlock proportional backoff w/ auto tuning Rik van Riel
2012-12-21 23:50 ` [RFC PATCH 1/3] x86,smp: move waiting on contended lock out of line Rik van Riel
2012-12-22  3:05   ` Steven Rostedt
2012-12-22  4:40   ` Michel Lespinasse
2012-12-22  4:48     ` Rik van Riel
2012-12-23 22:52   ` Rafael Aquini
2012-12-21 23:51 ` [RFC PATCH 2/3] x86,smp: proportional backoff for ticket spinlocks Rik van Riel
2012-12-22  3:07   ` Steven Rostedt
2012-12-22  3:14     ` Steven Rostedt
2012-12-22  3:47       ` Rik van Riel
2012-12-22  4:44   ` Michel Lespinasse
2012-12-23 22:55   ` Rafael Aquini
2012-12-21 23:51 ` [RFC PATCH 3/3] x86,smp: auto tune spinlock backoff delay factor Rik van Riel
2012-12-21 23:56   ` [RFC PATCH 3/3 -v2] " Rik van Riel
2012-12-22  0:18     ` Eric Dumazet
2012-12-22  2:43       ` Rik van Riel
2012-12-22  0:48     ` Eric Dumazet
2012-12-22  2:57       ` Rik van Riel
2012-12-22  3:29     ` Eric Dumazet
2012-12-22  3:44       ` Rik van Riel
2012-12-22  3:33     ` Steven Rostedt
2012-12-22  3:50       ` Rik van Riel
2012-12-26 19:10         ` Eric Dumazet
2012-12-26 19:27           ` Eric Dumazet
2012-12-26 19:51           ` Rik van Riel
2012-12-27  6:07             ` Michel Lespinasse
2012-12-27 14:27               ` Eric Dumazet
2012-12-27 14:35                 ` Rik van Riel
2012-12-27 18:41                   ` Jan Beulich
2012-12-27 19:09                     ` Rik van Riel
2013-01-03  9:05                       ` Jan Beulich
2013-01-03 13:24                         ` Steven Rostedt
2013-01-03 13:35                           ` Eric Dumazet
2013-01-03 15:32                             ` Steven Rostedt
2013-01-03 16:10                               ` Eric Dumazet
2013-01-03 16:45                                 ` Steven Rostedt
2013-01-03 17:54                                   ` Eric Dumazet
2012-12-27 18:49                   ` Eric Dumazet
2012-12-27 19:31                     ` Rik van Riel
2012-12-29  0:42                       ` Eric Dumazet
2012-12-29 10:27           ` Michel Lespinasse
2013-01-03 18:17             ` Eric Dumazet
2012-12-22  0:47   ` [RFC PATCH 3/3] " David Daney
2012-12-22  2:51     ` Rik van Riel
2012-12-22  3:49       ` Steven Rostedt
2012-12-22  3:58         ` Rik van Riel
2012-12-23 23:08           ` Rafael Aquini
2012-12-22  5:42   ` Michel Lespinasse
2012-12-22 14:32     ` Rik van Riel
2013-01-02  0:06 ` ticket spinlock proportional backoff experiments Michel Lespinasse
2013-01-02  0:09   ` [PATCH 1/2] x86,smp: simplify __ticket_spin_lock Michel Lespinasse
2013-01-02 15:31     ` Rik van Riel
2013-01-02  0:10   ` [PATCH 2/2] x86,smp: proportional backoff for ticket spinlocks Michel Lespinasse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).