Spinlocks waiting with interrupts disabled / preempt disabled.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Spinlocks waiting with interrupts disabled / preempt disabled.
@ 2008-05-06 19:26 Christoph Lameter
  2008-05-07  7:30 ` Ingo Molnar
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2008-05-06 19:26 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

We just had a couple of cases of systems failing because interrrupts were 
not serviced. Turned out that we were trying to acquire the treelock for 
write while lots of readers where starving the writer a bit. A device 
timed out before the write lock was acquired.

If interrupts would be enabled while busy waiting then such scenarios 
could be avoided.

Could we make the wait cases more interrupt / preempt friendly by 
reenabling interrupts / preempt while waiting? We had this interrupt 
friendly behavior for a long time on IA64. If we have a special busy case 
then we can also not use atomic ops and thus acquire the cacheline only 
for read.

F.e.

Index: linux/kernel/spinlock.c
===================================================================
--- linux.orig/kernel/spinlock.c	2008-05-05 12:22:16.000000000 -0500
+++ linux/kernel/spinlock.c	2008-05-06 14:05:43.953016660 -0500
@@ -132,10 +132,14 @@ unsigned long __lockfunc _write_lock_irq
 {
 	unsigned long flags;

+retry:
 	local_irq_save(flags);
-	preempt_disable();
-	_raw_write_lock(lock);
-	return flags;
+	if (_write_trylock(lock))
+		return flags;
+	local_irq_restore(flags);
+	while (!write_can_lock(lock))
+		cpu_relax();
+	goto retry;
 }
 EXPORT_SYMBOL(_write_lock_irqsave);

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks waiting with interrupts disabled / preempt disabled.
  2008-05-06 19:26 Spinlocks waiting with interrupts disabled / preempt disabled Christoph Lameter
@ 2008-05-07  7:30 ` Ingo Molnar
  2008-05-07 17:04   ` Christoph Lameter
  2008-05-09 16:35   ` Spinlocks waiting with interrupts disabled / preempt disabled Olaf Weber
  0 siblings, 2 replies; 35+ messages in thread
From: Ingo Molnar @ 2008-05-07  7:30 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, Peter Zijlstra


* Christoph Lameter <clameter@sgi.com> wrote:

> @@ -132,10 +132,14 @@ unsigned long __lockfunc _write_lock_irq
>  {
>  	unsigned long flags;
>  
> +retry:
>  	local_irq_save(flags);
> -	preempt_disable();
> -	_raw_write_lock(lock);
> -	return flags;
> +	if (_write_trylock(lock))
> +		return flags;
> +	local_irq_restore(flags);
> +	while (!write_can_lock(lock))
> +		cpu_relax();
> +	goto retry;
>  }
>  EXPORT_SYMBOL(_write_lock_irqsave);

hm, this is done on a too high level and will turn off some debugging 
code. I.e. if we dont just loop long but truly deadlock here we wont 
call lib/spinlock_debug.c's _raw_write_lock() code that does some sanity 
checks in the debug case.

so how about doing this on a deeper level and adding a new 
__raw_write_lock_flags() primitive that would look at the flags value 
and could enable interrupts in the lowlevel code?

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks waiting with interrupts disabled / preempt disabled.
  2008-05-07  7:30 ` Ingo Molnar
@ 2008-05-07 17:04   ` Christoph Lameter
  2008-05-07 17:24     ` Christoph Lameter
  2008-05-09 16:35   ` Spinlocks waiting with interrupts disabled / preempt disabled Olaf Weber
  1 sibling, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2008-05-07 17:04 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra

On Wed, 7 May 2008, Ingo Molnar wrote:

> > +		return flags;
> > +	local_irq_restore(flags);
> > +	while (!write_can_lock(lock))
> > +		cpu_relax();
> > +	goto retry;
> >  }
> >  EXPORT_SYMBOL(_write_lock_irqsave);
> 
> hm, this is done on a too high level and will turn off some debugging 
> code. I.e. if we dont just loop long but truly deadlock here we wont 
> call lib/spinlock_debug.c's _raw_write_lock() code that does some sanity 
> checks in the debug case.

Right. I guessed that given the gazillion helper functions and wanted to 
know how to address this in the right way.

> so how about doing this on a deeper level and adding a new 
> __raw_write_lock_flags() primitive that would look at the flags value 
> and could enable interrupts in the lowlevel code?

Ok will look at that. Note that this is not unique to _write_lock_irqsave 
but all other locks that disable interrupts seem to have the same issue. 

We are likely going to duplicate a lot of functions.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks waiting with interrupts disabled / preempt disabled.
  2008-05-07 17:04   ` Christoph Lameter
@ 2008-05-07 17:24     ` Christoph Lameter
  2008-05-07 18:49       ` Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable Christoph Lameter
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2008-05-07 17:24 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra

Looks like an issue with old code in SLES10. Upstream has:

/*
 * If lockdep is enabled then we use the non-preemption spin-ops
 * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are
 * not re-enabled during lock-acquire (which the preempt-spin-ops do):
 */
#if !defined(CONFIG_GENERIC_LOCKBREAK) || defined(CONFIG_DEBUG_LOCK_ALLOC)

.. followed by function definitions that disable interrupts before 
spinning.

#else

...

unsigned long __lockfunc _##op##_lock_irqsave(locktype##_t *lock)       \
{                                                                       \
        unsigned long flags;                                            \
                                                                        \
        for (;;) {                                                      \
                preempt_disable();                                      \
                local_irq_save(flags);                                  \
                if (likely(_raw_##op##_trylock(lock)))                  \
                        break;                                          \
                local_irq_restore(flags);                               \
                preempt_enable();                                       \
                                                                        \
                if (!(lock)->break_lock)                                \
                        (lock)->break_lock = 1;                         \
                while (!op##_can_lock(lock) && (lock)->break_lock)      \
                        _raw_##op##_relax(&lock->raw_lock);             \
        }                                                               \
        (lock)->break_lock = 0;                                         \
        return flags;                                                   \
}                                                                       \
                                       
So this is only an issue if certain debug options are enabled.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-05-07 17:24     ` Christoph Lameter
@ 2008-05-07 18:49       ` Christoph Lameter
  2008-05-09 10:26         ` Ingo Molnar
  2008-06-23 17:19         ` Petr Tesarik
  0 siblings, 2 replies; 35+ messages in thread
From: Christoph Lameter @ 2008-05-07 18:49 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra

Subject: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disabled

The nice spinlock functions that enable interrupts while spinning are only
usable if GENERIC_LOCKBREAK is set (and we do not debug locks but lock
debugging would not be used for a production configuration).

Factor out the dependencies on the ->lock_break field from the nice functions
into a set of BREAKLOCK macros (cannot be functions since the type of the lock
variable varies).

The nice spinlock function can then also be used if !GENERIC_LOCKBREAK

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 kernel/spinlock.c |   40 ++++++++++++++++++++++++++++++----------
 1 file changed, 30 insertions(+), 10 deletions(-)

Index: linux-2.6/kernel/spinlock.c
===================================================================
--- linux-2.6.orig/kernel/spinlock.c	2008-05-07 11:19:31.000000000 -0700
+++ linux-2.6/kernel/spinlock.c	2008-05-07 11:40:56.000000000 -0700
@@ -65,7 +65,7 @@ EXPORT_SYMBOL(_write_trylock);
  * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are
  * not re-enabled during lock-acquire (which the preempt-spin-ops do):
  */
-#if !defined(CONFIG_GENERIC_LOCKBREAK) || defined(CONFIG_DEBUG_LOCK_ALLOC)
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
 
 void __lockfunc _read_lock(rwlock_t *lock)
 {
@@ -192,7 +192,7 @@ void __lockfunc _write_lock(rwlock_t *lo
 
 EXPORT_SYMBOL(_write_lock);
 
-#else /* CONFIG_PREEMPT: */
+#else /* CONFIG_DEBUG_LOCK_ALLOC: */
 
 /*
  * This could be a long-held lock. We both prepare to spin for a long
@@ -201,6 +201,28 @@ EXPORT_SYMBOL(_write_lock);
  *
  * (We do this in a function because inlining it would be excessive.)
  */
+#ifdef CONFIG_GENERIC_LOCKBREAK
+
+#define BREAKLOCK(lock) (lock)->break_lock
+
+#define BREAKLOCK_ENABLE(lock)						\
+do {									\
+	if (!(lock)->break_lock)					\
+		(lock)->break_lock = 1;					\
+} while (0)
+
+#define BREAKLOCK_DISABLE(lock)						\
+do {									\
+	 (lock)->break_lock = 0;					\
+} while (0)
+
+#else
+
+#define BREAKLOCK(lock)	0
+#define BREAKLOCK_ENABLE(lock)	do { } while (0)
+#define BREAKLOCK_DISABLE(lock) do { } while (0)
+
+#endif
 
 #define BUILD_LOCK_OPS(op, locktype)					\
 void __lockfunc _##op##_lock(locktype##_t *lock)			\
@@ -211,12 +233,11 @@ void __lockfunc _##op##_lock(locktype##_
 			break;						\
 		preempt_enable();					\
 									\
-		if (!(lock)->break_lock)				\
-			(lock)->break_lock = 1;				\
-		while (!op##_can_lock(lock) && (lock)->break_lock)	\
+		BREAKLOCK_ENABLE(lock);					\
+		while (!op##_can_lock(lock) && BREAKLOCK(lock))		\
 			_raw_##op##_relax(&lock->raw_lock);		\
 	}								\
-	(lock)->break_lock = 0;						\
+	BREAKLOCK_DISABLE(lock);					\
 }									\
 									\
 EXPORT_SYMBOL(_##op##_lock);						\
@@ -233,12 +254,11 @@ unsigned long __lockfunc _##op##_lock_ir
 		local_irq_restore(flags);				\
 		preempt_enable();					\
 									\
-		if (!(lock)->break_lock)				\
-			(lock)->break_lock = 1;				\
-		while (!op##_can_lock(lock) && (lock)->break_lock)	\
+		BREAKLOCK_ENABLE(lock);					\
+		while (!op##_can_lock(lock) && BREAKLOCK(lock))		\
 			_raw_##op##_relax(&lock->raw_lock);		\
 	}								\
-	(lock)->break_lock = 0;						\
+	BREAKLOCK_DISABLE(lock);					\
 	return flags;							\
 }									\
 									\


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-05-07 18:49       ` Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable Christoph Lameter
@ 2008-05-09 10:26         ` Ingo Molnar
  2008-05-09 16:28           ` Christoph Lameter
  2008-06-23 17:19         ` Petr Tesarik
  1 sibling, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2008-05-09 10:26 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-kernel, Peter Zijlstra


* Christoph Lameter <clameter@sgi.com> wrote:

> Subject: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid 
> spin with irqs disabled
> 
> The nice spinlock functions that enable interrupts while spinning are 
> only usable if GENERIC_LOCKBREAK is set (and we do not debug locks but 
> lock debugging would not be used for a production configuration).
> 
> Factor out the dependencies on the ->lock_break field from the nice 
> functions into a set of BREAKLOCK macros (cannot be functions since 
> the type of the lock variable varies).
> 
> The nice spinlock function can then also be used if !GENERIC_LOCKBREAK

hm, there was some lockdep complication in this area. I guess we could 
use the 'nice' variants too if their irq-enabling/disabling was properly 
lockdep annotated and tracked by the irqflags machinery?

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-05-09 10:26         ` Ingo Molnar
@ 2008-05-09 16:28           ` Christoph Lameter
  0 siblings, 0 replies; 35+ messages in thread
From: Christoph Lameter @ 2008-05-09 16:28 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra

On Fri, 9 May 2008, Ingo Molnar wrote:

> hm, there was some lockdep complication in this area. I guess we could 
> use the 'nice' variants too if their irq-enabling/disabling was properly 
> lockdep annotated and tracked by the irqflags machinery?

I thought that we already fall back to the functions that spin while 
having interrupts disabled if lockdep is on?

One issue: The BREAKLOCK value needs to be 1 in the fallback case 
otherwise we needlessly run atomic ops.

---
 kernel/spinlock.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/kernel/spinlock.c
===================================================================
--- linux-2.6.orig/kernel/spinlock.c	2008-05-07 16:00:06.000000000 -0700
+++ linux-2.6/kernel/spinlock.c	2008-05-07 16:00:29.000000000 -0700
@@ -218,7 +218,7 @@ do {									\
 
 #else
 
-#define BREAKLOCK(lock)	0
+#define BREAKLOCK(lock)	1
 #define BREAKLOCK_ENABLE(lock)	do { } while (0)
 #define BREAKLOCK_DISABLE(lock) do { } while (0)
 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks waiting with interrupts disabled / preempt disabled.
  2008-05-07  7:30 ` Ingo Molnar
  2008-05-07 17:04   ` Christoph Lameter
@ 2008-05-09 16:35   ` Olaf Weber
  2008-05-09 17:56     ` Peter Zijlstra
  1 sibling, 1 reply; 35+ messages in thread
From: Olaf Weber @ 2008-05-09 16:35 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Christoph Lameter, linux-kernel, Peter Zijlstra

Ingo Molnar writes:
> * Christoph Lameter <clameter@sgi.com> wrote:

>> @@ -132,10 +132,14 @@ unsigned long __lockfunc _write_lock_irq
>> {
>> unsigned long flags;
>> 
>> +retry:
>> local_irq_save(flags);
>> -	preempt_disable();
>> -	_raw_write_lock(lock);
>> -	return flags;
>> +	if (_write_trylock(lock))
>> +		return flags;
>> +	local_irq_restore(flags);
>> +	while (!write_can_lock(lock))
>> +		cpu_relax();
>> +	goto retry;
>> }
>> EXPORT_SYMBOL(_write_lock_irqsave);

> hm, this is done on a too high level and will turn off some debugging 
> code. I.e. if we dont just loop long but truly deadlock here we wont 
> call lib/spinlock_debug.c's _raw_write_lock() code that does some sanity 
> checks in the debug case.

> so how about doing this on a deeper level and adding a new 
> __raw_write_lock_flags() primitive that would look at the flags value 
> and could enable interrupts in the lowlevel code?

Thinking of an alternative approach, I'd like to revisit the original
problem for a moment: on a (big) system running a kernel based on
2.6.16 we found that in the following stack trace, pdflush has just
enabled irqs again on its cpu after running with irqs disabled for
about 2 minutes.

>> bt 0xe0000168f4378000
================================================================
STACK TRACE FOR TASK: 0xe0000168f4378000 (pdflush)

 0 kdba_main_loop+0x14c [0xa0000001004190cc]
 1 kdb+0x11fc [0xa0000001002a639c]
 2 kdb_ipi+0x11c [0xa0000001002a12dc]
 3 handle_IPI+0x27c [0xa00000010005671c]
 4 handle_IRQ_event+0x8c [0xa0000001000fa40c]
 5 __do_IRQ+0x12c [0xa0000001000fa5cc]
 6 ia64_handle_irq+0x15c [0xa00000010001213c]
 7 ia64_leave_kernel [0xa00000010000cb20]
 8 _write_unlock_irqrestore+0x40 [0xa000000100558500]
 9 test_set_page_writeback+0x18c [0xa000000100114b6c]
10 xfs_start_page_writeback+0x2c [0xa0000002e4ab914c]
11 xfs_page_state_convert+0xa0c [0xa0000002e4abcbac]
12 xfs_vm_writepage+0x19c [0xa0000002e4abd13c]
13 mpage_writepages+0x3fc [0xa0000001001bcb5c]
14 xfs_vm_writepages+0xac [0xa0000002e4ab9eac]
15 do_writepages+0xac [0xa0000001001136cc]
16 __writeback_single_inode+0x47c [0xa0000001001b8f7c]
17 sync_sb_inodes+0x4ac [0xa0000001001b9cec]
18 writeback_inodes+0x19c [0xa0000001001baafc]
19 wb_kupdate+0x26c [0xa000000100113bcc]
20 pdflush+0x38c [0xa00000010011582c]
21 kthread+0x22c [0xa0000001000c9dac]
22 kernel_thread_helper+0xcc [0xa000000100012f6c]
23 start_kernel_thread+0x1c [0xa0000001000094bc]
================================================================

There is, as far as I could tell, no code in the critical region of
test_set_page_writeback that could take this much time, except for the
call to _write_lock_irqsave itself.  There the root of the problem
seems to have been that many threads went after the read lock, and an
rwlock_t will allow readers in while readers hold the lock, even if a
writer needs it, thus potentially starving the writer indefinitely (up
to 2 minutes in the case under discussion).

Re-enabling irqs when looping in _write_lock_irqsave (and *_irq and
*_bh) works here because _write_lock_irqsave() wasn't called with irqs
disabled.  But pdflush will still spend a long time getting the lock.

The reason we're pursuing this nevertheless is that it is easy to do
starting from the current code-base, and doesn't impose API or ABI
changes.


An alternative approach would to view this as a bug of the rwlock_t
code in particular, and fix it by keeping new readers from a acquiring
an rwlock_t if a writer is also trying to lock it.  For x86 at least,
I can sketch an implementation that is based on the new ticket-based
spinlocks:
 - use the ticket-based lock for writers
 - add a reader count for readers (16 bits out of 32 are free)
 - readers get in if the write lock is idle
 - if the write lock wasn't idle, readers loop until it is
 - on obtaining the ticket-based lock, the writer waits for active
   readers to leave (eg reader count to go to zero)
Contention by writers now keeps readers out, which may be a problem.

Advantages of such an approach:
 - Writers are "fair" amongst themselves.
 - Readers cannot hold off writers.

Disadvantages:
 - Contention among writers can hold off readers indefinitely.
   This scheme only works if the write lock is not "permanently"
   contended, otherwise it goes from frying pan to fire.
 - Requires changes to the asm-coded locking primitives, which means
   it is likely to be done for x86 and ia64 only, unless the platform
   maintainers step in.

Given the above, is this worth pursuing for inclusion in mainline?


A third solution would be to look at this problem on a case-by-case
basis.  In the case under discussion, there may be other kernel bugs
that aggravate the problem (it is an old kernel after all) and maybe
this just means the address_space.tree_lock ought to be replaced with
something else (though I wouldn't claim to know what).

-- 
Olaf Weber                 SGI               Phone:  +31(0)30-6696752
                           Veldzigt 2b       Fax:    +31(0)30-6696799
Technical Lead             3454 PW de Meern  Vnet:   955-7151
Storage Software           The Netherlands   Email:  olaf@sgi.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks waiting with interrupts disabled / preempt disabled.
  2008-05-09 16:35   ` Spinlocks waiting with interrupts disabled / preempt disabled Olaf Weber
@ 2008-05-09 17:56     ` Peter Zijlstra
  2008-05-09 18:00       ` Christoph Lameter
  2008-05-09 20:01       ` Olaf Weber
  0 siblings, 2 replies; 35+ messages in thread
From: Peter Zijlstra @ 2008-05-09 17:56 UTC (permalink / raw)
  To: Olaf Weber; +Cc: Ingo Molnar, Christoph Lameter, linux-kernel

On Fri, 2008-05-09 at 18:35 +0200, Olaf Weber wrote:
> Ingo Molnar writes:
> > * Christoph Lameter <clameter@sgi.com> wrote:
> 
> >> @@ -132,10 +132,14 @@ unsigned long __lockfunc _write_lock_irq
> >> {
> >> unsigned long flags;
> >> 
> >> +retry:
> >> local_irq_save(flags);
> >> -	preempt_disable();
> >> -	_raw_write_lock(lock);
> >> -	return flags;
> >> +	if (_write_trylock(lock))
> >> +		return flags;
> >> +	local_irq_restore(flags);
> >> +	while (!write_can_lock(lock))
> >> +		cpu_relax();
> >> +	goto retry;
> >> }
> >> EXPORT_SYMBOL(_write_lock_irqsave);
> 
> > hm, this is done on a too high level and will turn off some debugging 
> > code. I.e. if we dont just loop long but truly deadlock here we wont 
> > call lib/spinlock_debug.c's _raw_write_lock() code that does some sanity 
> > checks in the debug case.
> 
> > so how about doing this on a deeper level and adding a new 
> > __raw_write_lock_flags() primitive that would look at the flags value 
> > and could enable interrupts in the lowlevel code?
> 
> Thinking of an alternative approach, I'd like to revisit the original
> problem for a moment: on a (big) system running a kernel based on
> 2.6.16 we found that in the following stack trace, pdflush has just
> enabled irqs again on its cpu after running with irqs disabled for
> about 2 minutes.
> 
> >> bt 0xe0000168f4378000
> ================================================================
> STACK TRACE FOR TASK: 0xe0000168f4378000 (pdflush)
> 
>  0 kdba_main_loop+0x14c [0xa0000001004190cc]
>  1 kdb+0x11fc [0xa0000001002a639c]
>  2 kdb_ipi+0x11c [0xa0000001002a12dc]
>  3 handle_IPI+0x27c [0xa00000010005671c]
>  4 handle_IRQ_event+0x8c [0xa0000001000fa40c]
>  5 __do_IRQ+0x12c [0xa0000001000fa5cc]
>  6 ia64_handle_irq+0x15c [0xa00000010001213c]
>  7 ia64_leave_kernel [0xa00000010000cb20]
>  8 _write_unlock_irqrestore+0x40 [0xa000000100558500]
>  9 test_set_page_writeback+0x18c [0xa000000100114b6c]
> 10 xfs_start_page_writeback+0x2c [0xa0000002e4ab914c]
> 11 xfs_page_state_convert+0xa0c [0xa0000002e4abcbac]
> 12 xfs_vm_writepage+0x19c [0xa0000002e4abd13c]
> 13 mpage_writepages+0x3fc [0xa0000001001bcb5c]
> 14 xfs_vm_writepages+0xac [0xa0000002e4ab9eac]
> 15 do_writepages+0xac [0xa0000001001136cc]
> 16 __writeback_single_inode+0x47c [0xa0000001001b8f7c]
> 17 sync_sb_inodes+0x4ac [0xa0000001001b9cec]
> 18 writeback_inodes+0x19c [0xa0000001001baafc]
> 19 wb_kupdate+0x26c [0xa000000100113bcc]
> 20 pdflush+0x38c [0xa00000010011582c]
> 21 kthread+0x22c [0xa0000001000c9dac]
> 22 kernel_thread_helper+0xcc [0xa000000100012f6c]
> 23 start_kernel_thread+0x1c [0xa0000001000094bc]
> ================================================================
> 
> There is, as far as I could tell, no code in the critical region of
> test_set_page_writeback that could take this much time, except for the
> call to _write_lock_irqsave itself.  There the root of the problem
> seems to have been that many threads went after the read lock, and an
> rwlock_t will allow readers in while readers hold the lock, even if a
> writer needs it, thus potentially starving the writer indefinitely (up
> to 2 minutes in the case under discussion).
> 
> Re-enabling irqs when looping in _write_lock_irqsave (and *_irq and
> *_bh) works here because _write_lock_irqsave() wasn't called with irqs
> disabled.  But pdflush will still spend a long time getting the lock.
> 
> The reason we're pursuing this nevertheless is that it is easy to do
> starting from the current code-base, and doesn't impose API or ABI
> changes.

This would be your best option for you kabi constrained .16 kernel.

> An alternative approach would to view this as a bug of the rwlock_t
> code in particular, and fix it by keeping new readers from a acquiring
> an rwlock_t if a writer is also trying to lock it.  For x86 at least,
> I can sketch an implementation that is based on the new ticket-based
> spinlocks:
>  - use the ticket-based lock for writers
>  - add a reader count for readers (16 bits out of 32 are free)
>  - readers get in if the write lock is idle
>  - if the write lock wasn't idle, readers loop until it is
>  - on obtaining the ticket-based lock, the writer waits for active
>    readers to leave (eg reader count to go to zero)
> Contention by writers now keeps readers out, which may be a problem.
> 
> Advantages of such an approach:
>  - Writers are "fair" amongst themselves.
>  - Readers cannot hold off writers.
> 
> Disadvantages:
>  - Contention among writers can hold off readers indefinitely.
>    This scheme only works if the write lock is not "permanently"
>    contended, otherwise it goes from frying pan to fire.
>  - Requires changes to the asm-coded locking primitives, which means
>    it is likely to be done for x86 and ia64 only, unless the platform
>    maintainers step in.
> 
> Given the above, is this worth pursuing for inclusion in mainline?

Keep in mind that rwlock_t must support recusive read locks. Much code
depends on this.

> A third solution would be to look at this problem on a case-by-case
> basis.  In the case under discussion, there may be other kernel bugs
> that aggravate the problem (it is an old kernel after all) and maybe
> this just means the address_space.tree_lock ought to be replaced with
> something else (though I wouldn't claim to know what).

That has been done by Nick Piggin, the lockless pagecache work removes
the read side of the tree_lock.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks waiting with interrupts disabled / preempt disabled.
  2008-05-09 17:56     ` Peter Zijlstra
@ 2008-05-09 18:00       ` Christoph Lameter
  2008-05-09 18:06         ` Peter Zijlstra
  2008-05-09 20:01       ` Olaf Weber
  1 sibling, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2008-05-09 18:00 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Olaf Weber, Ingo Molnar, linux-kernel, npiggin

On Fri, 9 May 2008, Peter Zijlstra wrote:

> > A third solution would be to look at this problem on a case-by-case
> > basis.  In the case under discussion, there may be other kernel bugs
> > that aggravate the problem (it is an old kernel after all) and maybe
> > this just means the address_space.tree_lock ought to be replaced with
> > something else (though I wouldn't claim to know what).
> 
> That has been done by Nick Piggin, the lockless pagecache work removes
> the read side of the tree_lock.

When is that going to get merged? find_get_page is still:

struct page * find_get_page(struct address_space *mapping, pgoff_t offset)
{
        struct page *page;

        read_lock_irq(&mapping->tree_lock);
        page = radix_tree_lookup(&mapping->page_tree, offset);
        if (page)
                page_cache_get(page);
        read_unlock_irq(&mapping->tree_lock);
        return page;
}
EXPORT_SYMBOL(find_get_page);


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks waiting with interrupts disabled / preempt disabled.
  2008-05-09 18:00       ` Christoph Lameter
@ 2008-05-09 18:06         ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2008-05-09 18:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Olaf Weber, Ingo Molnar, linux-kernel, npiggin, Andrew Morton,
	Linus Torvalds

On Fri, 2008-05-09 at 11:00 -0700, Christoph Lameter wrote:
> On Fri, 9 May 2008, Peter Zijlstra wrote:
> 
> > > A third solution would be to look at this problem on a case-by-case
> > > basis.  In the case under discussion, there may be other kernel bugs
> > > that aggravate the problem (it is an old kernel after all) and maybe
> > > this just means the address_space.tree_lock ought to be replaced with
> > > something else (though I wouldn't claim to know what).
> > 
> > That has been done by Nick Piggin, the lockless pagecache work removes
> > the read side of the tree_lock.
> 
> When is that going to get merged? find_get_page is still:
> 
> struct page * find_get_page(struct address_space *mapping, pgoff_t offset)
> {
>         struct page *page;
> 
>         read_lock_irq(&mapping->tree_lock);
>         page = radix_tree_lookup(&mapping->page_tree, offset);
>         if (page)
>                 page_cache_get(page);
>         read_unlock_irq(&mapping->tree_lock);
>         return page;
> }
> EXPORT_SYMBOL(find_get_page);

Yeah, no idea when - although it looks like Nick has got another
speculative reference user lined up, which would greatly help its case.

Nick, can you post a series of patches against something recent that has
the latest lockless pagecache and all the fast_gup work in it, including
the power bits that use sepculative get?

We can then go over all this stuff and perhaps manage to trick Andrew
and Linus into taking it for .27.

Linus seemed interrested enough in the fast_gup bits.

And just in case someone forgot, the lockless pagecache has been in -rt
for a long long time - so far I've never seen issues with it.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks waiting with interrupts disabled / preempt disabled.
  2008-05-09 17:56     ` Peter Zijlstra
  2008-05-09 18:00       ` Christoph Lameter
@ 2008-05-09 20:01       ` Olaf Weber
  1 sibling, 0 replies; 35+ messages in thread
From: Olaf Weber @ 2008-05-09 20:01 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, Christoph Lameter, linux-kernel

Peter Zijlstra writes:
> On Fri, 2008-05-09 at 18:35 +0200, Olaf Weber wrote:

>> An alternative approach would to view this as a bug of the rwlock_t
>> code in particular, and fix it by keeping new readers from a acquiring
>> an rwlock_t if a writer is also trying to lock it.  For x86 at least,
>> I can sketch an implementation that is based on the new ticket-based
>> spinlocks:
>>  - use the ticket-based lock for writers
>>  - add a reader count for readers (16 bits out of 32 are free)
>>  - readers get in if the write lock is idle
>>  - if the write lock wasn't idle, readers loop until it is
>>  - on obtaining the ticket-based lock, the writer waits for active
>>    readers to leave (eg reader count to go to zero)
>> Contention by writers now keeps readers out, which may be a problem.
>> 
>> Advantages of such an approach:
>>  - Writers are "fair" amongst themselves.
>>  - Readers cannot hold off writers.
>> 
>> Disadvantages:
>>  - Contention among writers can hold off readers indefinitely.
>>    This scheme only works if the write lock is not "permanently"
>>    contended, otherwise it goes from frying pan to fire.
>>  - Requires changes to the asm-coded locking primitives, which means
>>    it is likely to be done for x86 and ia64 only, unless the platform
>>    maintainers step in.
>> 
>> Given the above, is this worth pursuing for inclusion in mainline?

> Keep in mind that rwlock_t must support recusive read locks. Much code
> depends on this.

Ouch.  Given the possibility of something like this:

    thread A                thread B
    read_lock         
                            write_lock -- attempt, will spin
    read_lock -- recursive

we'd have to deal with not having the write be a barrier to the
recursive read_lock, yet a barrier to a new read_lock.  That's not
possible without keeping track of read_lock owners, which would
greatly complicate the locks.

IMHO recursive locking is moderately evil at the best of time, but if
it's there, its there.  (And if it's necessary, it has to be allowed.)



>> A third solution would be to look at this problem on a case-by-case
>> basis.  In the case under discussion, there may be other kernel bugs
>> that aggravate the problem (it is an old kernel after all) and maybe
>> this just means the address_space.tree_lock ought to be replaced with
>> something else (though I wouldn't claim to know what).

> That has been done by Nick Piggin, the lockless pagecache work removes
> the read side of the tree_lock.

Now that is good news.

-- 
Olaf Weber                 SGI               Phone:  +31(0)30-6696752
                           Veldzigt 2b       Fax:    +31(0)30-6696799
Technical Lead             3454 PW de Meern  Vnet:   955-7151
Storage Software           The Netherlands   Email:  olaf@sgi.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-05-07 18:49       ` Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable Christoph Lameter
  2008-05-09 10:26         ` Ingo Molnar
@ 2008-06-23 17:19         ` Petr Tesarik
  2008-06-23 17:54           ` Peter Zijlstra
  2008-06-23 18:20           ` Christoph Lameter
  1 sibling, 2 replies; 35+ messages in thread
From: Petr Tesarik @ 2008-06-23 17:19 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Ingo Molnar, linux-kernel, Peter Zijlstra

On Wed, 2008-05-07 at 11:49 -0700, Christoph Lameter wrote:
> Subject: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disabled
> 
> The nice spinlock functions that enable interrupts while spinning are only
> usable if GENERIC_LOCKBREAK is set (and we do not debug locks but lock
> debugging would not be used for a production configuration).
> 
> Factor out the dependencies on the ->lock_break field from the nice functions
> into a set of BREAKLOCK macros (cannot be functions since the type of the lock
> variable varies).
> 
> The nice spinlock function can then also be used if !GENERIC_LOCKBREAK
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> ---
>  kernel/spinlock.c |   40 ++++++++++++++++++++++++++++++----------
>  1 file changed, 30 insertions(+), 10 deletions(-)
> 
> Index: linux-2.6/kernel/spinlock.c
> ===================================================================
> --- linux-2.6.orig/kernel/spinlock.c	2008-05-07 11:19:31.000000000 -0700
> +++ linux-2.6/kernel/spinlock.c	2008-05-07 11:40:56.000000000 -0700
> @@ -65,7 +65,7 @@ EXPORT_SYMBOL(_write_trylock);
>   * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are
>   * not re-enabled during lock-acquire (which the preempt-spin-ops do):
>   */
> -#if !defined(CONFIG_GENERIC_LOCKBREAK) || defined(CONFIG_DEBUG_LOCK_ALLOC)
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC

Maybe I'm just blind, but doesn't this change effectively disable any
arch-specific optimized code for _raw_*_lock?

If CONFIG_DEBUG_LOCK_ALLOC is set, then CONFIG_DEBUG_SPINLOCK must also
be set, so in that case the debugging versions of _raw_*_lock are used.
But if CONFIG_DEBUG_LOCK_ALLOC is _not_ set, then the locks are built
with _trylock and _can_lock primitives.

What am I missing here?

Petr Tesarik



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-06-23 17:19         ` Petr Tesarik
@ 2008-06-23 17:54           ` Peter Zijlstra
  2008-06-23 18:20           ` Christoph Lameter
  1 sibling, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2008-06-23 17:54 UTC (permalink / raw)
  To: Petr Tesarik; +Cc: Christoph Lameter, Ingo Molnar, linux-kernel

On Mon, 2008-06-23 at 19:19 +0200, Petr Tesarik wrote:
> On Wed, 2008-05-07 at 11:49 -0700, Christoph Lameter wrote:
> > Subject: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disabled
> > 
> > The nice spinlock functions that enable interrupts while spinning are only
> > usable if GENERIC_LOCKBREAK is set (and we do not debug locks but lock
> > debugging would not be used for a production configuration).
> > 
> > Factor out the dependencies on the ->lock_break field from the nice functions
> > into a set of BREAKLOCK macros (cannot be functions since the type of the lock
> > variable varies).
> > 
> > The nice spinlock function can then also be used if !GENERIC_LOCKBREAK
> > 
> > Signed-off-by: Christoph Lameter <clameter@sgi.com>
> > 
> > ---
> >  kernel/spinlock.c |   40 ++++++++++++++++++++++++++++++----------
> >  1 file changed, 30 insertions(+), 10 deletions(-)
> > 
> > Index: linux-2.6/kernel/spinlock.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/spinlock.c	2008-05-07 11:19:31.000000000 -0700
> > +++ linux-2.6/kernel/spinlock.c	2008-05-07 11:40:56.000000000 -0700
> > @@ -65,7 +65,7 @@ EXPORT_SYMBOL(_write_trylock);
> >   * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are
> >   * not re-enabled during lock-acquire (which the preempt-spin-ops do):
> >   */
> > -#if !defined(CONFIG_GENERIC_LOCKBREAK) || defined(CONFIG_DEBUG_LOCK_ALLOC)
> > +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> 
> Maybe I'm just blind, but doesn't this change effectively disable any
> arch-specific optimized code for _raw_*_lock?
> 
> If CONFIG_DEBUG_LOCK_ALLOC is set, then CONFIG_DEBUG_SPINLOCK must also
> be set, so in that case the debugging versions of _raw_*_lock are used.
> But if CONFIG_DEBUG_LOCK_ALLOC is _not_ set, then the locks are built
> with _trylock and _can_lock primitives.
> 
> What am I missing here?

No, I think you're right.

I've been playing with these patches:

  http://programming.kicks-ass.net/kernel-patches/spinlocks/

But as it stands it breaks !CONFIG_PREEMPT... ought to find some other
cycles to spend on it..




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-06-23 17:19         ` Petr Tesarik
  2008-06-23 17:54           ` Peter Zijlstra
@ 2008-06-23 18:20           ` Christoph Lameter
  2008-06-23 20:39             ` Peter Zijlstra
  1 sibling, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2008-06-23 18:20 UTC (permalink / raw)
  To: Petr Tesarik; +Cc: Ingo Molnar, linux-kernel, Peter Zijlstra

On Mon, 23 Jun 2008, Petr Tesarik wrote:

> Maybe I'm just blind, but doesn't this change effectively disable any
> arch-specific optimized code for _raw_*_lock?

True.  Only the __raw_xxx_trylock is still used.

> If CONFIG_DEBUG_LOCK_ALLOC is set, then CONFIG_DEBUG_SPINLOCK must also
> be set, so in that case the debugging versions of _raw_*_lock are used.
> But if CONFIG_DEBUG_LOCK_ALLOC is _not_ set, then the locks are built
> with _trylock and _can_lock primitives.
> 
> What am I missing here?

It is good that the locks are build with _trylock and _can_lock because 
then we can reenable interrupts while spinning.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-06-23 18:20           ` Christoph Lameter
@ 2008-06-23 20:39             ` Peter Zijlstra
  2008-06-23 20:45               ` Christoph Lameter
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2008-06-23 20:39 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Petr Tesarik, Ingo Molnar, linux-kernel

On Mon, 2008-06-23 at 11:20 -0700, Christoph Lameter wrote:
> On Mon, 23 Jun 2008, Petr Tesarik wrote:
> 
> > Maybe I'm just blind, but doesn't this change effectively disable any
> > arch-specific optimized code for _raw_*_lock?
> 
> True.  Only the __raw_xxx_trylock is still used.
> 
> > If CONFIG_DEBUG_LOCK_ALLOC is set, then CONFIG_DEBUG_SPINLOCK must also
> > be set, so in that case the debugging versions of _raw_*_lock are used.
> > But if CONFIG_DEBUG_LOCK_ALLOC is _not_ set, then the locks are built
> > with _trylock and _can_lock primitives.
> > 
> > What am I missing here?
> 
> It is good that the locks are build with _trylock and _can_lock because 
> then we can reenable interrupts while spinning.

Well, good and bad, the turn side is that fairness schemes like ticket
locks are utterly defeated.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-06-23 20:39             ` Peter Zijlstra
@ 2008-06-23 20:45               ` Christoph Lameter
  2008-06-23 20:58                 ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2008-06-23 20:45 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Petr Tesarik, Ingo Molnar, linux-kernel

On Mon, 23 Jun 2008, Peter Zijlstra wrote:

> > It is good that the locks are build with _trylock and _can_lock because 
> > then we can reenable interrupts while spinning.
> 
> Well, good and bad, the turn side is that fairness schemes like ticket
> locks are utterly defeated.

True. But maybe we can make these fairness schemes more generic so that 
they can go into core code?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-06-23 20:45               ` Christoph Lameter
@ 2008-06-23 20:58                 ` Peter Zijlstra
  2008-06-26  2:51                   ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2008-06-23 20:58 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Petr Tesarik, Ingo Molnar, linux-kernel

On Mon, 2008-06-23 at 13:45 -0700, Christoph Lameter wrote:
> On Mon, 23 Jun 2008, Peter Zijlstra wrote:
> 
> > > It is good that the locks are build with _trylock and _can_lock because 
> > > then we can reenable interrupts while spinning.
> > 
> > Well, good and bad, the turn side is that fairness schemes like ticket
> > locks are utterly defeated.
> 
> True. But maybe we can make these fairness schemes more generic so that 
> they can go into core code?

The trouble with ticket locks is that they can't handle waiters going
away - or in this case getting preempted by irq handlers. The one who
took the ticket must pass it on, so if you're preempted it just sits
there being idle, until you get back to deal with the lock.

But yeah, perhaps another fairness scheme might work in the generic
code..


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-06-23 20:58                 ` Peter Zijlstra
@ 2008-06-26  2:51                   ` Jeremy Fitzhardinge
  2008-06-26  6:51                     ` Peter Zijlstra
                                       ` (3 more replies)
  0 siblings, 4 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2008-06-26  2:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Petr Tesarik, Ingo Molnar, linux-kernel,
	Nick Piggin

Peter Zijlstra wrote:
> On Mon, 2008-06-23 at 13:45 -0700, Christoph Lameter wrote:
>   
>> On Mon, 23 Jun 2008, Peter Zijlstra wrote:
>>
>>     
>>>> It is good that the locks are build with _trylock and _can_lock because 
>>>> then we can reenable interrupts while spinning.
>>>>         
>>> Well, good and bad, the turn side is that fairness schemes like ticket
>>> locks are utterly defeated.
>>>       
>> True. But maybe we can make these fairness schemes more generic so that 
>> they can go into core code?
>>     
>
> The trouble with ticket locks is that they can't handle waiters going
> away - or in this case getting preempted by irq handlers. The one who
> took the ticket must pass it on, so if you're preempted it just sits
> there being idle, until you get back to deal with the lock.
>
> But yeah, perhaps another fairness scheme might work in the generic
> code..

Thomas Friebel presented results at the Xen Summit this week showing 
that ticket locks are an absolute disaster for scalability in a virtual 
environment, for a similar reason.  It's a bit irritating if the lock 
holder vcpu gets preempted by the hypervisor, but its much worse when 
they release the lock: unless the vcpu scheduler gives a cpu to the vcpu 
with the next ticket, it can waste up to N timeslices spinning.

I'm experimenting with adding pvops hook to allow you to put in new 
spinlock implementations on the fly.  If nothing else, it will be useful 
for experimenting with different algorithms.  But it definitely seems 
like the old unfair lock algorithm played much better with a virtual 
environment, because the next cpu to get the lock is the next one the 
scheduler gives time, rather than dictating an order - and the scheduler 
should mitigate the unfairness that ticket locks were designed to solve.

    J

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-06-26  2:51                   ` Jeremy Fitzhardinge
@ 2008-06-26  6:51                     ` Peter Zijlstra
  2008-06-26 15:49                       ` Jeremy Fitzhardinge
  2008-06-26  9:17                     ` Petr Tesarik
                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2008-06-26  6:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Peter Zijlstra, Christoph Lameter, Petr Tesarik, Ingo Molnar,
	linux-kernel, Nick Piggin

On Wed, 2008-06-25 at 19:51 -0700, Jeremy Fitzhardinge wrote:
> Peter Zijlstra wrote:
> > On Mon, 2008-06-23 at 13:45 -0700, Christoph Lameter wrote:
> >   
> >> On Mon, 23 Jun 2008, Peter Zijlstra wrote:
> >>
> >>     
> >>>> It is good that the locks are build with _trylock and _can_lock because 
> >>>> then we can reenable interrupts while spinning.
> >>>>         
> >>> Well, good and bad, the turn side is that fairness schemes like ticket
> >>> locks are utterly defeated.
> >>>       
> >> True. But maybe we can make these fairness schemes more generic so that 
> >> they can go into core code?
> >>     
> >
> > The trouble with ticket locks is that they can't handle waiters going
> > away - or in this case getting preempted by irq handlers. The one who
> > took the ticket must pass it on, so if you're preempted it just sits
> > there being idle, until you get back to deal with the lock.
> >
> > But yeah, perhaps another fairness scheme might work in the generic
> > code..
> 
> Thomas Friebel presented results at the Xen Summit this week showing 
> that ticket locks are an absolute disaster for scalability in a virtual 
> environment, for a similar reason.  It's a bit irritating if the lock 
> holder vcpu gets preempted by the hypervisor, but its much worse when 
> they release the lock: unless the vcpu scheduler gives a cpu to the vcpu 
> with the next ticket, it can waste up to N timeslices spinning.
> 
> I'm experimenting with adding pvops hook to allow you to put in new 
> spinlock implementations on the fly.  If nothing else, it will be useful 
> for experimenting with different algorithms.  But it definitely seems 
> like the old unfair lock algorithm played much better with a virtual 
> environment, because the next cpu to get the lock is the next one the 
> scheduler gives time, rather than dictating an order - and the scheduler 
> should mitigate the unfairness that ticket locks were designed to solve.

Paravirt spinlocks sounds like a good idea anyway, that way you can make
them scheduling locks (from the host's POV) when the lock owner (vcpu)
isn't running.

Burning time spinning on !running vcpus seems like a waste to me.

As for the scheduler solving the unfairness that ticket locks solve,
that cannot be done. The ticket lock solves intra-cpu fairness for a
resource other than time. The cpu scheduler only cares about fairness in
time, and its intra-cpu fairness is on a larger scale than most spinlock
hold times - so even if time and the locked resource would overlap it
wouldn't work.

The simple scenario is running N tasks on N cpus that all pound the same
lock, cache issues will make it unlikely the lock would migrate away
from whatever cpu its on, essentially starving all the other N-1 cpus.

Ticket locks solve that exact issue, all the scheduler can do is ensure
they're all spending an equal amount of time on the cpu, whether that is
spinning for lock acquisition or getting actual work done is beyond its
scope.




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-06-26  2:51                   ` Jeremy Fitzhardinge
  2008-06-26  6:51                     ` Peter Zijlstra
@ 2008-06-26  9:17                     ` Petr Tesarik
  2008-06-26 17:02                       ` Christoph Lameter
  2008-07-07 11:50                     ` Nick Piggin
  2008-07-07 19:46                     ` Rik van Riel
  3 siblings, 1 reply; 35+ messages in thread
From: Petr Tesarik @ 2008-06-26  9:17 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Peter Zijlstra, Christoph Lameter, Ingo Molnar, linux-kernel,
	Nick Piggin

On Wed, 2008-06-25 at 19:51 -0700, Jeremy Fitzhardinge wrote:
> [...]
> I'm experimenting with adding pvops hook to allow you to put in new 
> spinlock implementations on the fly.  If nothing else, it will be useful 
> for experimenting with different algorithms.  But it definitely seems 
> like the old unfair lock algorithm played much better with a virtual 
> environment, because the next cpu to get the lock is the next one the 
> scheduler gives time, rather than dictating an order - and the scheduler 
> should mitigate the unfairness that ticket locks were designed to solve.

We really should paravirtualize spin locks, because there's always
something better to do than just burn time spinning. But in a
non-virtualized environment, tickets (or a similar scheme) should be
preserved.

We should probably re-think the whole locking scheme, because spinlocks
were designed to be held for a short period of time. This was a fair
assumption when they were introduced, but obviously it is now false in
many cases (such as virtualization).

Ticket-based spinlocks have actually already changed the original
design, so why not implement a generic "lock scheduler" on top of
spinlock_t and rwlock_t?

Petr Tesarik

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-06-26  6:51                     ` Peter Zijlstra
@ 2008-06-26 15:49                       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2008-06-26 15:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Peter Zijlstra, Christoph Lameter, Petr Tesarik, Ingo Molnar,
	linux-kernel, Nick Piggin

Peter Zijlstra wrote:
> Paravirt spinlocks sounds like a good idea anyway, that way you can make
> them scheduling locks (from the host's POV) when the lock owner (vcpu)
> isn't running.
>
> Burning time spinning on !running vcpus seems like a waste to me.
>   

In theory.  But in practice Linux locks are so low-contention that not 
much time seems to get wasted.  I've been doing experiments with 
spin-a-while-then-block locks, but they never got to the -then-block 
part in my test.  The burning cycles spinning only gets expensive if the 
lock-holder vcpu gets preempted, and there's other cpus spinning on that 
lock; but if locks are held only briefly, then there's little chance 
being preempted while holding the lock.

At least that's at the scale I've been testing, with only two cores.  I 
expect things look different with 8 or 16 cores and similarly scaled guests.

> As for the scheduler solving the unfairness that ticket locks solve,
>   

No, I never said scheduler would the problem, merely mitigate it.

> that cannot be done. The ticket lock solves intra-cpu fairness for a
> resource other than time. The cpu scheduler only cares about fairness in
> time, and its intra-cpu fairness is on a larger scale than most spinlock
> hold times - so even if time and the locked resource would overlap it
> wouldn't work.
>
> The simple scenario is running N tasks on N cpus that all pound the same
> lock, cache issues will make it unlikely the lock would migrate away
> from whatever cpu its on, essentially starving all the other N-1 cpus.
>   

Yep.  But in practice, the scheduler will steal the real cpu from under 
the vcpu dominating the lock and upset the pathalogical pattern.  I'm 
not saying its ideal, but the starvation case that ticketlocks solve is 
pretty rare in the large scheme of things.

Also, ticket locks don't help either, if the lock is always 
transitioning between locked->unlocked->locked on all cpus.  It only 
helps in the case of one cpu doing rapid lock->unlock transitions while 
others wait on the lock.

> Ticket locks solve that exact issue, all the scheduler can do is ensure
> they're all spending an equal amount of time on the cpu, whether that is
> spinning for lock acquisition or getting actual work done is beyond its
> scope.
>   

Yes.  But the problem with ticket locks is that they dictate a 
scheduling order, and if you fail to schedule in that order vast amounts 
of time are wasted.  You can get into this state:

   1. vcpu A takes a lock
   2. vcpu A is preempted, effectively making a 5us lock be held for 30ms
   3. vcpus E,D,C,B try to take the lock in that order
   4. they all spin, wasting time.  bad, but no worse than the old lock
      algorithm
   5. vcpu A eventually runs again and releases the lock
   6. vcpu B runs, spinning until preempted
   7. vcpu C runs, spinning until preempted
   8. vcpu D runs, spinning until preempted
   9. vcpu E runs, and takes the lock and releases it
  10. (repeat spinning on B,C,D until D gets the lock)
  11. (repeat spinning on B,C until C gets the lock)
  12. B finally gets the lock

Steps 6-12 are all caused by ticket locks, and the situation is 
exacerbated by vcpus F-Z trying to get the lock in the meantime while 
its all tangled up handing out tickets in the right order.

The problem is that the old lock-byte locks made no fairness guarantees, 
and interacted badly with the hardware causing severe starvation in some 
cases.  Ticket locks are too fair, and absolutely dictate the order in 
which the lock is taken.  Really, all that's needed is the weaker 
assertion that "when I release the lock, any current spinner should get 
the lock".

    J

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-06-26  9:17                     ` Petr Tesarik
@ 2008-06-26 17:02                       ` Christoph Lameter
  2008-06-26 17:48                         ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2008-06-26 17:02 UTC (permalink / raw)
  To: Petr Tesarik
  Cc: Jeremy Fitzhardinge, Peter Zijlstra, Ingo Molnar, linux-kernel,
	Nick Piggin

On Thu, 26 Jun 2008, Petr Tesarik wrote:

> We should probably re-think the whole locking scheme, because spinlocks
> were designed to be held for a short period of time. This was a fair
> assumption when they were introduced, but obviously it is now false in
> many cases (such as virtualization).

And NUMA.

> Ticket-based spinlocks have actually already changed the original
> design, so why not implement a generic "lock scheduler" on top of
> spinlock_t and rwlock_t?

How about making the semaphore / mutex code as fast as spinlocks and just 
use that?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-06-26 17:02                       ` Christoph Lameter
@ 2008-06-26 17:48                         ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2008-06-26 17:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Petr Tesarik, Jeremy Fitzhardinge, Peter Zijlstra, Ingo Molnar,
	linux-kernel, Nick Piggin

On Thu, 2008-06-26 at 10:02 -0700, Christoph Lameter wrote:
> On Thu, 26 Jun 2008, Petr Tesarik wrote:
> 
> > We should probably re-think the whole locking scheme, because spinlocks
> > were designed to be held for a short period of time. This was a fair
> > assumption when they were introduced, but obviously it is now false in
> > many cases (such as virtualization).
> 
> And NUMA.
> 
> > Ticket-based spinlocks have actually already changed the original
> > design, so why not implement a generic "lock scheduler" on top of
> > spinlock_t and rwlock_t?
> 
> How about making the semaphore / mutex code as fast as spinlocks and just 
> use that?

Hehe, please have a look at a recent -rt tree and possibly test some of
that if you're interested.

It has adaptive spins for rt_mutex. Which is to say, as long as the lock
owner is current on its cpu, we'll spin trying to acquire the lock. The
reasoning here is that as long as the owner is running, there is a fair
chance it'll release the lock soonish.

This is another case where semaphores can never match, due to their lack
of ownership.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-06-26  2:51                   ` Jeremy Fitzhardinge
  2008-06-26  6:51                     ` Peter Zijlstra
  2008-06-26  9:17                     ` Petr Tesarik
@ 2008-07-07 11:50                     ` Nick Piggin
  2008-07-07 11:52                       ` Nick Piggin
  2008-07-07 15:53                       ` Jeremy Fitzhardinge
  2008-07-07 19:46                     ` Rik van Riel
  3 siblings, 2 replies; 35+ messages in thread
From: Nick Piggin @ 2008-07-07 11:50 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Peter Zijlstra, Christoph Lameter, Petr Tesarik, Ingo Molnar,
	linux-kernel

On Thursday 26 June 2008 12:51, Jeremy Fitzhardinge wrote:
> Peter Zijlstra wrote:
> > On Mon, 2008-06-23 at 13:45 -0700, Christoph Lameter wrote:
> >> On Mon, 23 Jun 2008, Peter Zijlstra wrote:
> >>>> It is good that the locks are build with _trylock and _can_lock
> >>>> because then we can reenable interrupts while spinning.
> >>>
> >>> Well, good and bad, the turn side is that fairness schemes like ticket
> >>> locks are utterly defeated.
> >>
> >> True. But maybe we can make these fairness schemes more generic so that
> >> they can go into core code?
> >
> > The trouble with ticket locks is that they can't handle waiters going
> > away - or in this case getting preempted by irq handlers. The one who
> > took the ticket must pass it on, so if you're preempted it just sits
> > there being idle, until you get back to deal with the lock.
> >
> > But yeah, perhaps another fairness scheme might work in the generic
> > code..
>
> Thomas Friebel presented results at the Xen Summit this week showing
> that ticket locks are an absolute disaster for scalability in a virtual
> environment, for a similar reason.  It's a bit irritating if the lock
> holder vcpu gets preempted by the hypervisor, but its much worse when
> they release the lock: unless the vcpu scheduler gives a cpu to the vcpu
> with the next ticket, it can waste up to N timeslices spinning.

I didn't realise it is good practice to run multiple "virtual CPUs"
of the same guest on a single physical CPU on the host...


> I'm experimenting with adding pvops hook to allow you to put in new
> spinlock implementations on the fly.  If nothing else, it will be useful
> for experimenting with different algorithms.  But it definitely seems
> like the old unfair lock algorithm played much better with a virtual
> environment, because the next cpu to get the lock is the next one the
> scheduler gives time, rather than dictating an order - and the scheduler
> should mitigate the unfairness that ticket locks were designed to solve.

... if it is good practice, then, virtualizing spinlocks I guess is
reasonable. If not, then "don't do that". Considering that probably
many bare metal systems will run pv kernels, every little cost adds
up.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-07-07 11:50                     ` Nick Piggin
@ 2008-07-07 11:52                       ` Nick Piggin
  2008-07-07 15:56                         ` Jeremy Fitzhardinge
  2008-07-07 15:53                       ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 35+ messages in thread
From: Nick Piggin @ 2008-07-07 11:52 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Peter Zijlstra, Christoph Lameter, Petr Tesarik, Ingo Molnar,
	linux-kernel

On Monday 07 July 2008 21:50, Nick Piggin wrote:
> On Thursday 26 June 2008 12:51, Jeremy Fitzhardinge wrote:

> > Thomas Friebel presented results at the Xen Summit this week showing
> > that ticket locks are an absolute disaster for scalability in a virtual
> > environment, for a similar reason.  It's a bit irritating if the lock
> > holder vcpu gets preempted by the hypervisor, but its much worse when
> > they release the lock: unless the vcpu scheduler gives a cpu to the vcpu
> > with the next ticket, it can waste up to N timeslices spinning.
>
> I didn't realise it is good practice to run multiple "virtual CPUs"
> of the same guest on a single physical CPU on the host...
>
> > I'm experimenting with adding pvops hook to allow you to put in new
> > spinlock implementations on the fly.  If nothing else, it will be useful
> > for experimenting with different algorithms.  But it definitely seems
> > like the old unfair lock algorithm played much better with a virtual
> > environment, because the next cpu to get the lock is the next one the
> > scheduler gives time, rather than dictating an order - and the scheduler
> > should mitigate the unfairness that ticket locks were designed to solve.
>
> ... if it is good practice, then, virtualizing spinlocks I guess is
> reasonable. If not, then "don't do that". Considering that probably
> many bare metal systems will run pv kernels, every little cost adds
> up.

Although, you wouldn't need to oversubscribe physical CPUs to hit
suboptimal behaviour.

Basically, I just ask for performance improvement to be measured
with some "realistic" configuration, then it should be easier to
justify.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-07-07 11:50                     ` Nick Piggin
  2008-07-07 11:52                       ` Nick Piggin
@ 2008-07-07 15:53                       ` Jeremy Fitzhardinge
  1 sibling, 0 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-07 15:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Christoph Lameter, Petr Tesarik, Ingo Molnar,
	linux-kernel

Nick Piggin wrote:
> On Thursday 26 June 2008 12:51, Jeremy Fitzhardinge wrote:
>   
>> Peter Zijlstra wrote:
>>     
>>> On Mon, 2008-06-23 at 13:45 -0700, Christoph Lameter wrote:
>>>       
>>>> On Mon, 23 Jun 2008, Peter Zijlstra wrote:
>>>>         
>>>>>> It is good that the locks are build with _trylock and _can_lock
>>>>>> because then we can reenable interrupts while spinning.
>>>>>>             
>>>>> Well, good and bad, the turn side is that fairness schemes like ticket
>>>>> locks are utterly defeated.
>>>>>           
>>>> True. But maybe we can make these fairness schemes more generic so that
>>>> they can go into core code?
>>>>         
>>> The trouble with ticket locks is that they can't handle waiters going
>>> away - or in this case getting preempted by irq handlers. The one who
>>> took the ticket must pass it on, so if you're preempted it just sits
>>> there being idle, until you get back to deal with the lock.
>>>
>>> But yeah, perhaps another fairness scheme might work in the generic
>>> code..
>>>       
>> Thomas Friebel presented results at the Xen Summit this week showing
>> that ticket locks are an absolute disaster for scalability in a virtual
>> environment, for a similar reason.  It's a bit irritating if the lock
>> holder vcpu gets preempted by the hypervisor, but its much worse when
>> they release the lock: unless the vcpu scheduler gives a cpu to the vcpu
>> with the next ticket, it can waste up to N timeslices spinning.
>>     
>
> I didn't realise it is good practice to run multiple "virtual CPUs"
> of the same guest on a single physical CPU on the host...
>   

It isn't.  It makes no sense at all to give a guest more vcpus than 
physical cpus, so that kind of contention won't happen in general.  But 
the bad locking scenario happens when there's any system-wide 
contention, so it could happen if some other virtual machine preempts a 
vcpu holding a lock.  And once a lock ends up being (effectively) held 
for 30ms rather than 30us, the likelihood of going into contention goes 
way up, and you can enter the catastrophic N^2 unlock->relock state.

My measurements show that reverting to the old lock-byte algorithm 
avoids the worst case, and just results in a bit of excessive spinning.  
Replacing it with a smarter spin-then-block-vcpu algorithm doesn't 
really benefit the specific guest VM very much (kernbench elapsed time 
is only slightly improved), but its consumption of physical cpu time can 
go down by ~10%.

>> I'm experimenting with adding pvops hook to allow you to put in new
>> spinlock implementations on the fly.  If nothing else, it will be useful
>> for experimenting with different algorithms.  But it definitely seems
>> like the old unfair lock algorithm played much better with a virtual
>> environment, because the next cpu to get the lock is the next one the
>> scheduler gives time, rather than dictating an order - and the scheduler
>> should mitigate the unfairness that ticket locks were designed to solve.
>>     
>
> ... if it is good practice, then, virtualizing spinlocks I guess is
> reasonable. If not, then "don't do that". Considering that probably
> many bare metal systems will run pv kernels, every little cost adds
> up

I'm aware of that.  In my current implementation the overhead amounts to 
an extra direct call in the lock/unlock path, but that can be eliminated 
with a small amount of restructuring (by making spin_lock/unlock() 
inline functions, and having the call to raw_spin_lock/unlock within 
it).  The pvops patching machinery removes any indirect calls or jumps.

    J

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-07-07 11:52                       ` Nick Piggin
@ 2008-07-07 15:56                         ` Jeremy Fitzhardinge
  2008-07-08  2:08                           ` Nick Piggin
  0 siblings, 1 reply; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-07 15:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Christoph Lameter, Petr Tesarik, Ingo Molnar,
	linux-kernel

Nick Piggin wrote:
> Although, you wouldn't need to oversubscribe physical CPUs to hit
> suboptimal behaviour.
>
> Basically, I just ask for performance improvement to be measured
> with some "realistic" configuration, then it should be easier to
> justify.

Overcommitting cpus system-wide is pretty common.  A use-case is 
consolidating a pile of underused non-performance-critical servers into 
one physical machine with a lot of VMs.  It doesn't take much load to 
cause a dhcp request to the dhcp server to steal a cpu away from the 
mysql server while it holds a lock...

I'll mail out my patches so we can have a more concrete discussion shortly.

    J

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-06-26  2:51                   ` Jeremy Fitzhardinge
                                       ` (2 preceding siblings ...)
  2008-07-07 11:50                     ` Nick Piggin
@ 2008-07-07 19:46                     ` Rik van Riel
  2008-07-07 20:14                       ` Jeremy Fitzhardinge
  3 siblings, 1 reply; 35+ messages in thread
From: Rik van Riel @ 2008-07-07 19:46 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Peter Zijlstra, Christoph Lameter, Petr Tesarik, Ingo Molnar,
	linux-kernel, Nick Piggin

On Wed, 25 Jun 2008 19:51:12 -0700
Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Thomas Friebel presented results at the Xen Summit this week showing 
> that ticket locks are an absolute disaster for scalability in a virtual 
> environment, for a similar reason.  It's a bit irritating if the lock 
> holder vcpu gets preempted by the hypervisor, but its much worse when 
> they release the lock: unless the vcpu scheduler gives a cpu to the vcpu 
> with the next ticket, it can waste up to N timeslices spinning.
> 
> I'm experimenting with adding pvops hook to allow you to put in new 
> spinlock implementations on the fly.

Alternatively, the guest could tell the host which vcpus
are next in line for a ticket spinlock, or a vcpu that gets
scheduled but is not supposed to grab the lock yet can give
some CPU time to the vcpu that should get the lock next.

I believe the IBM PPC64 people have done some work to implement
just that.

-- 
All Rights Reversed

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-07-07 19:46                     ` Rik van Riel
@ 2008-07-07 20:14                       ` Jeremy Fitzhardinge
  2008-07-08  2:07                         ` Nick Piggin
  0 siblings, 1 reply; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-07 20:14 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Christoph Lameter, Petr Tesarik, Ingo Molnar,
	linux-kernel, Nick Piggin

Rik van Riel wrote:
> Alternatively, the guest could tell the host which vcpus
> are next in line for a ticket spinlock, or a vcpu that gets
> scheduled but is not supposed to grab the lock yet can give
> some CPU time to the vcpu that should get the lock next.
>   

Those are possible, but would either 1) require hypervisor changes, 
and/or 2) changes no less extensive than the ones I had to make anyway.

Thomas's proposal was to modify the scheduler to try to avoiding 
preempting vcpus while they're in kernel mode.  That's nice because it 
requires no guest changes, and seems at least somewhat successful at 
mitigating the problem.  But it can't completely solve the problem, and 
you end up with a bunch of heuristics in the hypervisor to decide who to 
preempt.

The other point, of course, is that ticket locks are massive overkill 
for the problem they're trying to solve.  It's one thing to introduce an 
element of fairness into spinlocks, its another to impose strict FIFO 
ordering.  It would be enough to make the locks "polite" by preventing a 
new lock-holder from taking the lock while its under contention.  
Something like:

	union lock {
		unsigned short word;
		struct { unsigned char lock, count; };
	};

spin_lock:		# ebx - lock pointer
	movw	$0x0001, %ax	# add 1 to lock, 0 to count
	xaddw	%ax, (%ebx)	# attempt to take lock and test user count
	testw	%ax,%ax
	jnz	slow

taken:	ret

# slow path
slow:	lock incb 1(%ebx)	# inc count

1:	rep;nop
	cmpb	$0,(%ebx)
	jnz	1b		# wait for unlocked

	movb	$1,%al		# attempt to take lock (count already increased)
	xchgb	%al,(%ebx)
	testb	%al,%al
	jnz	1b

	lock decb 1(%ebx)	# drop count
	jmp	taken

spin_unlock:
	movb	$0,(%ebx)
	ret

The uncontended fastpath is similar to the pre-ticket locks, but it 
refuses to take the lock if there are other waiters, even if the lock is 
not currently held.  This prevents the rapid lock-unlock cycle on one 
CPU from starving another CPU, which I understand was the original 
problem tickets locks were trying to solve.

But it also means that all the contended spinners get the lock in 
whatever order the system decides to give it to them, rather than 
imposing a strict order.

> I believe the IBM PPC64 people have done some work to implement
> just that.
>   

Do you have any references?

    J

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-07-07 20:14                       ` Jeremy Fitzhardinge
@ 2008-07-08  2:07                         ` Nick Piggin
  2008-07-08  5:57                           ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 35+ messages in thread
From: Nick Piggin @ 2008-07-08  2:07 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Rik van Riel, Peter Zijlstra, Christoph Lameter, Petr Tesarik,
	Ingo Molnar, linux-kernel

On Tuesday 08 July 2008 06:14, Jeremy Fitzhardinge wrote:

> The other point, of course, is that ticket locks are massive overkill
> for the problem they're trying to solve.

No they aren't.

> It's one thing to introduce an 
> element of fairness into spinlocks, its another to impose strict FIFO
> ordering.  It would be enough to make the locks "polite" by preventing a
> new lock-holder from taking the lock while its under contention.
> Something like:
>
> 	union lock {
> 		unsigned short word;
> 		struct { unsigned char lock, count; };
> 	};
>
> spin_lock:		# ebx - lock pointer
> 	movw	$0x0001, %ax	# add 1 to lock, 0 to count
> 	xaddw	%ax, (%ebx)	# attempt to take lock and test user count
> 	testw	%ax,%ax
> 	jnz	slow
>
> taken:	ret
>
> # slow path
> slow:	lock incb 1(%ebx)	# inc count
>
> 1:	rep;nop
> 	cmpb	$0,(%ebx)
> 	jnz	1b		# wait for unlocked
>
> 	movb	$1,%al		# attempt to take lock (count already increased)
> 	xchgb	%al,(%ebx)
> 	testb	%al,%al
> 	jnz	1b
>
> 	lock decb 1(%ebx)	# drop count
> 	jmp	taken
>
> spin_unlock:
> 	movb	$0,(%ebx)
> 	ret
>
>
> The uncontended fastpath is similar to the pre-ticket locks, but it
> refuses to take the lock if there are other waiters, even if the lock is
> not currently held.  This prevents the rapid lock-unlock cycle on one
> CPU from starving another CPU, which I understand was the original
> problem tickets locks were trying to solve.

They prevent lots of unfairness and starvation problems. The most
prominent one (ie. actually observed in Linux) was a single CPU
being totally starved by N others (to the point where lockup timers
would kick in).

As an aside, these locks you propose are also a lot more costly in
the contended path. 4 vs 1 atomic operations on the lock cacheline
is not so great.

> But it also means that all the contended spinners get the lock in
> whatever order the system decides to give it to them, rather than
> imposing a strict order.

The exact problem is that the system actively does the wrong thing
when you allow it to decide.

Unlike simple cacheline access, I don't believe it is such a good
idea to batch up locks many times on the same CPU for example. While
it surely could improve performance in specific situations, I think
that if code is taking and releasing a lock many times, then it most
probably should be either reworked to hold the lock for longer, or
changed completely. And once locks become *really* contended, then
the cost of moving the critical section to another CPU is really
drowned out by the cost of contention itself (all the idle time,
and taking/releasing the lock cacheline).

So far my theory has held up (except for virtualized systems).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-07-07 15:56                         ` Jeremy Fitzhardinge
@ 2008-07-08  2:08                           ` Nick Piggin
  0 siblings, 0 replies; 35+ messages in thread
From: Nick Piggin @ 2008-07-08  2:08 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Peter Zijlstra, Christoph Lameter, Petr Tesarik, Ingo Molnar,
	linux-kernel

On Tuesday 08 July 2008 01:56, Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:
> > Although, you wouldn't need to oversubscribe physical CPUs to hit
> > suboptimal behaviour.
> >
> > Basically, I just ask for performance improvement to be measured
> > with some "realistic" configuration, then it should be easier to
> > justify.
>
> Overcommitting cpus system-wide is pretty common.

(yes, sorry I meant: oversubscribing with a single guest)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-07-08  2:07                         ` Nick Piggin
@ 2008-07-08  5:57                           ` Jeremy Fitzhardinge
  2008-07-08  8:41                             ` Nick Piggin
  0 siblings, 1 reply; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-08  5:57 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Rik van Riel, Peter Zijlstra, Christoph Lameter, Petr Tesarik,
	Ingo Molnar, linux-kernel

Nick Piggin wrote:
> On Tuesday 08 July 2008 06:14, Jeremy Fitzhardinge wrote:
>
>   
>> The other point, of course, is that ticket locks are massive overkill
>> for the problem they're trying to solve.
>>     
>
> No they aren't.
>
>
>   
>> It's one thing to introduce an 
>> element of fairness into spinlocks, its another to impose strict FIFO
>> ordering.  It would be enough to make the locks "polite" by preventing a
>> new lock-holder from taking the lock while its under contention.
>> Something like:
>>
>> 	union lock {
>> 		unsigned short word;
>> 		struct { unsigned char lock, count; };
>> 	};
>>
>> spin_lock:		# ebx - lock pointer
>> 	movw	$0x0001, %ax	# add 1 to lock, 0 to count
>> 	xaddw	%ax, (%ebx)	# attempt to take lock and test user count
>> 	testw	%ax,%ax
>> 	jnz	slow
>>
>> taken:	ret
>>
>> # slow path
>> slow:	lock incb 1(%ebx)	# inc count
>>
>> 1:	rep;nop
>> 	cmpb	$0,(%ebx)
>> 	jnz	1b		# wait for unlocked
>>
>> 	movb	$1,%al		# attempt to take lock (count already increased)
>> 	xchgb	%al,(%ebx)
>> 	testb	%al,%al
>> 	jnz	1b
>>
>> 	lock decb 1(%ebx)	# drop count
>> 	jmp	taken
>>
>> spin_unlock:
>> 	movb	$0,(%ebx)
>> 	ret
>>
>>
>> The uncontended fastpath is similar to the pre-ticket locks, but it
>> refuses to take the lock if there are other waiters, even if the lock is
>> not currently held.  This prevents the rapid lock-unlock cycle on one
>> CPU from starving another CPU, which I understand was the original
>> problem tickets locks were trying to solve.
>>     
>
> They prevent lots of unfairness and starvation problems. The most
> prominent one (ie. actually observed in Linux) was a single CPU
> being totally starved by N others (to the point where lockup timers
> would kick in).
>   

Yep.  My understanding was that the specific case was that cpu A was 
repeatedly taking and releasing the lock, while other cpus spin waiting 
for it, and that the cache coherence logic kept the cacheline owned by A 
(presumably because it kept modifying it).  Ticket locks work well in 
this case because, as you say, it enforces a fairness policy that the 
hardware doesn't implement.  Are there other cases that ticket locks 
help with?  Does the algorithm above solve the starvation issue?

> As an aside, these locks you propose are also a lot more costly in
> the contended path. 4 vs 1 atomic operations on the lock cacheline
> is not so great.
>   

Yep, that's not great.  But it doesn't bounce cache lines around as much 
either, so perhaps it doesn't make much difference.

But really, I was being my own devil's advocate, to see if there's some 
other lock algorithm which satisfies both the normal ticket-lock case 
and the virtualization case.

I have no real objections to ticket locks, so long as I can turn them off ;)

    J

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-07-08  5:57                           ` Jeremy Fitzhardinge
@ 2008-07-08  8:41                             ` Nick Piggin
  2008-07-08 15:58                               ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 35+ messages in thread
From: Nick Piggin @ 2008-07-08  8:41 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Rik van Riel, Peter Zijlstra, Christoph Lameter, Petr Tesarik,
	Ingo Molnar, linux-kernel

On Tuesday 08 July 2008 15:57, Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:

> > They prevent lots of unfairness and starvation problems. The most
> > prominent one (ie. actually observed in Linux) was a single CPU
> > being totally starved by N others (to the point where lockup timers
> > would kick in).
>
> Yep.  My understanding was that the specific case was that cpu A was
> repeatedly taking and releasing the lock, while other cpus spin waiting
> for it, and that the cache coherence logic kept the cacheline owned by A
> (presumably because it kept modifying it).

In that case, the CPUs I've tested seem to do a reasonable job of
avoiding hogging the lock. And they all obviously make something of
an attempt at starvation/fairness of simple cacheline access...

But spinlocks require cacheline access *at exactly the right time*
in order to make progress. So simple cacheline fairness can break
down quite easily.

> Ticket locks work well in 
> this case because, as you say, it enforces a fairness policy that the
> hardware doesn't implement.  Are there other cases that ticket locks
> help with?

Well just starvation and fairness mainly.

Interestingly, it actually has some better scalability properties under
high contention sometimes (because the lock aquisition slowpath requires
no further stores to the cacheline).

If a single CPU is allowed to take and release a contended lock
multiple times, I don't consider that in itself as a bad property of
a lock implementation. I just reasoned that (maybe a little counter
intuitively) it isn't such an important thing to have either...

> Does the algorithm above solve the starvation issue? 

I don't believe so because there is still no guarantee you get out of
the slowpath (it's fairly similar to old spinlocks in that regard).

> > As an aside, these locks you propose are also a lot more costly in
> > the contended path. 4 vs 1 atomic operations on the lock cacheline
> > is not so great.
>
> Yep, that's not great.  But it doesn't bounce cache lines around as much
> either, so perhaps it doesn't make much difference.
>
> But really, I was being my own devil's advocate, to see if there's some
> other lock algorithm which satisfies both the normal ticket-lock case
> and the virtualization case.

Yeah, I'm not sure. I expect it would be hard to match ticket lock's
simplicity and guarantees.

> I have no real objections to ticket locks, so long as I can turn them off
> ;)

Sure. Btw. I have no problems with your patchset, but if SMP guests
are seriously used on x86, just remember the fairness issue. At some
point you might find you'll need an even more advanced lock for Xen.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
  2008-07-08  8:41                             ` Nick Piggin
@ 2008-07-08 15:58                               ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-08 15:58 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Rik van Riel, Peter Zijlstra, Christoph Lameter, Petr Tesarik,
	Ingo Molnar, linux-kernel

Nick Piggin wrote:
> Sure. Btw. I have no problems with your patchset, but if SMP guests
> are seriously used on x86, just remember the fairness issue. At some
> point you might find you'll need an even more advanced lock for Xen.
>   

Perhaps.  The spin lock has a bounded number of iterations before it 
goes into the vcpu-blocking path.  At that point, it's no longer subject 
to the whims of the cache coherency protocol, and is explicitly woken.  
But it may still have issues when competing with a rapid lock-unlocker.

A ticket-lock variant with some kind of directed yield might be the way 
to go if it turns out to be a problem.

    J

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2008-07-08 15:58 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-06 19:26 Spinlocks waiting with interrupts disabled / preempt disabled Christoph Lameter
2008-05-07  7:30 ` Ingo Molnar
2008-05-07 17:04   ` Christoph Lameter
2008-05-07 17:24     ` Christoph Lameter
2008-05-07 18:49       ` Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable Christoph Lameter
2008-05-09 10:26         ` Ingo Molnar
2008-05-09 16:28           ` Christoph Lameter
2008-06-23 17:19         ` Petr Tesarik
2008-06-23 17:54           ` Peter Zijlstra
2008-06-23 18:20           ` Christoph Lameter
2008-06-23 20:39             ` Peter Zijlstra
2008-06-23 20:45               ` Christoph Lameter
2008-06-23 20:58                 ` Peter Zijlstra
2008-06-26  2:51                   ` Jeremy Fitzhardinge
2008-06-26  6:51                     ` Peter Zijlstra
2008-06-26 15:49                       ` Jeremy Fitzhardinge
2008-06-26  9:17                     ` Petr Tesarik
2008-06-26 17:02                       ` Christoph Lameter
2008-06-26 17:48                         ` Peter Zijlstra
2008-07-07 11:50                     ` Nick Piggin
2008-07-07 11:52                       ` Nick Piggin
2008-07-07 15:56                         ` Jeremy Fitzhardinge
2008-07-08  2:08                           ` Nick Piggin
2008-07-07 15:53                       ` Jeremy Fitzhardinge
2008-07-07 19:46                     ` Rik van Riel
2008-07-07 20:14                       ` Jeremy Fitzhardinge
2008-07-08  2:07                         ` Nick Piggin
2008-07-08  5:57                           ` Jeremy Fitzhardinge
2008-07-08  8:41                             ` Nick Piggin
2008-07-08 15:58                               ` Jeremy Fitzhardinge
2008-05-09 16:35   ` Spinlocks waiting with interrupts disabled / preempt disabled Olaf Weber
2008-05-09 17:56     ` Peter Zijlstra
2008-05-09 18:00       ` Christoph Lameter
2008-05-09 18:06         ` Peter Zijlstra
2008-05-09 20:01       ` Olaf Weber

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox