* Spinlocks waiting with interrupts disabled / preempt disabled.
@ 2008-05-06 19:26 Christoph Lameter
2008-05-07 7:30 ` Ingo Molnar
0 siblings, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2008-05-06 19:26 UTC (permalink / raw)
To: Ingo Molnar; +Cc: linux-kernel
We just had a couple of cases of systems failing because interrrupts were
not serviced. Turned out that we were trying to acquire the treelock for
write while lots of readers where starving the writer a bit. A device
timed out before the write lock was acquired.
If interrupts would be enabled while busy waiting then such scenarios
could be avoided.
Could we make the wait cases more interrupt / preempt friendly by
reenabling interrupts / preempt while waiting? We had this interrupt
friendly behavior for a long time on IA64. If we have a special busy case
then we can also not use atomic ops and thus acquire the cacheline only
for read.
F.e.
Index: linux/kernel/spinlock.c
===================================================================
--- linux.orig/kernel/spinlock.c 2008-05-05 12:22:16.000000000 -0500
+++ linux/kernel/spinlock.c 2008-05-06 14:05:43.953016660 -0500
@@ -132,10 +132,14 @@ unsigned long __lockfunc _write_lock_irq
{
unsigned long flags;
+retry:
local_irq_save(flags);
- preempt_disable();
- _raw_write_lock(lock);
- return flags;
+ if (_write_trylock(lock))
+ return flags;
+ local_irq_restore(flags);
+ while (!write_can_lock(lock))
+ cpu_relax();
+ goto retry;
}
EXPORT_SYMBOL(_write_lock_irqsave);
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks waiting with interrupts disabled / preempt disabled.
2008-05-06 19:26 Spinlocks waiting with interrupts disabled / preempt disabled Christoph Lameter
@ 2008-05-07 7:30 ` Ingo Molnar
2008-05-07 17:04 ` Christoph Lameter
2008-05-09 16:35 ` Spinlocks waiting with interrupts disabled / preempt disabled Olaf Weber
0 siblings, 2 replies; 35+ messages in thread
From: Ingo Molnar @ 2008-05-07 7:30 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-kernel, Peter Zijlstra
* Christoph Lameter <clameter@sgi.com> wrote:
> @@ -132,10 +132,14 @@ unsigned long __lockfunc _write_lock_irq
> {
> unsigned long flags;
>
> +retry:
> local_irq_save(flags);
> - preempt_disable();
> - _raw_write_lock(lock);
> - return flags;
> + if (_write_trylock(lock))
> + return flags;
> + local_irq_restore(flags);
> + while (!write_can_lock(lock))
> + cpu_relax();
> + goto retry;
> }
> EXPORT_SYMBOL(_write_lock_irqsave);
hm, this is done on a too high level and will turn off some debugging
code. I.e. if we dont just loop long but truly deadlock here we wont
call lib/spinlock_debug.c's _raw_write_lock() code that does some sanity
checks in the debug case.
so how about doing this on a deeper level and adding a new
__raw_write_lock_flags() primitive that would look at the flags value
and could enable interrupts in the lowlevel code?
Ingo
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks waiting with interrupts disabled / preempt disabled.
2008-05-07 7:30 ` Ingo Molnar
@ 2008-05-07 17:04 ` Christoph Lameter
2008-05-07 17:24 ` Christoph Lameter
2008-05-09 16:35 ` Spinlocks waiting with interrupts disabled / preempt disabled Olaf Weber
1 sibling, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2008-05-07 17:04 UTC (permalink / raw)
To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra
On Wed, 7 May 2008, Ingo Molnar wrote:
> > + return flags;
> > + local_irq_restore(flags);
> > + while (!write_can_lock(lock))
> > + cpu_relax();
> > + goto retry;
> > }
> > EXPORT_SYMBOL(_write_lock_irqsave);
>
> hm, this is done on a too high level and will turn off some debugging
> code. I.e. if we dont just loop long but truly deadlock here we wont
> call lib/spinlock_debug.c's _raw_write_lock() code that does some sanity
> checks in the debug case.
Right. I guessed that given the gazillion helper functions and wanted to
know how to address this in the right way.
> so how about doing this on a deeper level and adding a new
> __raw_write_lock_flags() primitive that would look at the flags value
> and could enable interrupts in the lowlevel code?
Ok will look at that. Note that this is not unique to _write_lock_irqsave
but all other locks that disable interrupts seem to have the same issue.
We are likely going to duplicate a lot of functions.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks waiting with interrupts disabled / preempt disabled.
2008-05-07 17:04 ` Christoph Lameter
@ 2008-05-07 17:24 ` Christoph Lameter
2008-05-07 18:49 ` Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable Christoph Lameter
0 siblings, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2008-05-07 17:24 UTC (permalink / raw)
To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra
Looks like an issue with old code in SLES10. Upstream has:
/*
* If lockdep is enabled then we use the non-preemption spin-ops
* even on CONFIG_PREEMPT, because lockdep assumes that interrupts are
* not re-enabled during lock-acquire (which the preempt-spin-ops do):
*/
#if !defined(CONFIG_GENERIC_LOCKBREAK) || defined(CONFIG_DEBUG_LOCK_ALLOC)
.. followed by function definitions that disable interrupts before
spinning.
#else
...
unsigned long __lockfunc _##op##_lock_irqsave(locktype##_t *lock) \
{ \
unsigned long flags; \
\
for (;;) { \
preempt_disable(); \
local_irq_save(flags); \
if (likely(_raw_##op##_trylock(lock))) \
break; \
local_irq_restore(flags); \
preempt_enable(); \
\
if (!(lock)->break_lock) \
(lock)->break_lock = 1; \
while (!op##_can_lock(lock) && (lock)->break_lock) \
_raw_##op##_relax(&lock->raw_lock); \
} \
(lock)->break_lock = 0; \
return flags; \
} \
So this is only an issue if certain debug options are enabled.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-05-07 17:24 ` Christoph Lameter
@ 2008-05-07 18:49 ` Christoph Lameter
2008-05-09 10:26 ` Ingo Molnar
2008-06-23 17:19 ` Petr Tesarik
0 siblings, 2 replies; 35+ messages in thread
From: Christoph Lameter @ 2008-05-07 18:49 UTC (permalink / raw)
To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra
Subject: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disabled
The nice spinlock functions that enable interrupts while spinning are only
usable if GENERIC_LOCKBREAK is set (and we do not debug locks but lock
debugging would not be used for a production configuration).
Factor out the dependencies on the ->lock_break field from the nice functions
into a set of BREAKLOCK macros (cannot be functions since the type of the lock
variable varies).
The nice spinlock function can then also be used if !GENERIC_LOCKBREAK
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
kernel/spinlock.c | 40 ++++++++++++++++++++++++++++++----------
1 file changed, 30 insertions(+), 10 deletions(-)
Index: linux-2.6/kernel/spinlock.c
===================================================================
--- linux-2.6.orig/kernel/spinlock.c 2008-05-07 11:19:31.000000000 -0700
+++ linux-2.6/kernel/spinlock.c 2008-05-07 11:40:56.000000000 -0700
@@ -65,7 +65,7 @@ EXPORT_SYMBOL(_write_trylock);
* even on CONFIG_PREEMPT, because lockdep assumes that interrupts are
* not re-enabled during lock-acquire (which the preempt-spin-ops do):
*/
-#if !defined(CONFIG_GENERIC_LOCKBREAK) || defined(CONFIG_DEBUG_LOCK_ALLOC)
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
void __lockfunc _read_lock(rwlock_t *lock)
{
@@ -192,7 +192,7 @@ void __lockfunc _write_lock(rwlock_t *lo
EXPORT_SYMBOL(_write_lock);
-#else /* CONFIG_PREEMPT: */
+#else /* CONFIG_DEBUG_LOCK_ALLOC: */
/*
* This could be a long-held lock. We both prepare to spin for a long
@@ -201,6 +201,28 @@ EXPORT_SYMBOL(_write_lock);
*
* (We do this in a function because inlining it would be excessive.)
*/
+#ifdef CONFIG_GENERIC_LOCKBREAK
+
+#define BREAKLOCK(lock) (lock)->break_lock
+
+#define BREAKLOCK_ENABLE(lock) \
+do { \
+ if (!(lock)->break_lock) \
+ (lock)->break_lock = 1; \
+} while (0)
+
+#define BREAKLOCK_DISABLE(lock) \
+do { \
+ (lock)->break_lock = 0; \
+} while (0)
+
+#else
+
+#define BREAKLOCK(lock) 0
+#define BREAKLOCK_ENABLE(lock) do { } while (0)
+#define BREAKLOCK_DISABLE(lock) do { } while (0)
+
+#endif
#define BUILD_LOCK_OPS(op, locktype) \
void __lockfunc _##op##_lock(locktype##_t *lock) \
@@ -211,12 +233,11 @@ void __lockfunc _##op##_lock(locktype##_
break; \
preempt_enable(); \
\
- if (!(lock)->break_lock) \
- (lock)->break_lock = 1; \
- while (!op##_can_lock(lock) && (lock)->break_lock) \
+ BREAKLOCK_ENABLE(lock); \
+ while (!op##_can_lock(lock) && BREAKLOCK(lock)) \
_raw_##op##_relax(&lock->raw_lock); \
} \
- (lock)->break_lock = 0; \
+ BREAKLOCK_DISABLE(lock); \
} \
\
EXPORT_SYMBOL(_##op##_lock); \
@@ -233,12 +254,11 @@ unsigned long __lockfunc _##op##_lock_ir
local_irq_restore(flags); \
preempt_enable(); \
\
- if (!(lock)->break_lock) \
- (lock)->break_lock = 1; \
- while (!op##_can_lock(lock) && (lock)->break_lock) \
+ BREAKLOCK_ENABLE(lock); \
+ while (!op##_can_lock(lock) && BREAKLOCK(lock)) \
_raw_##op##_relax(&lock->raw_lock); \
} \
- (lock)->break_lock = 0; \
+ BREAKLOCK_DISABLE(lock); \
return flags; \
} \
\
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-05-07 18:49 ` Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable Christoph Lameter
@ 2008-05-09 10:26 ` Ingo Molnar
2008-05-09 16:28 ` Christoph Lameter
2008-06-23 17:19 ` Petr Tesarik
1 sibling, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2008-05-09 10:26 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-kernel, Peter Zijlstra
* Christoph Lameter <clameter@sgi.com> wrote:
> Subject: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid
> spin with irqs disabled
>
> The nice spinlock functions that enable interrupts while spinning are
> only usable if GENERIC_LOCKBREAK is set (and we do not debug locks but
> lock debugging would not be used for a production configuration).
>
> Factor out the dependencies on the ->lock_break field from the nice
> functions into a set of BREAKLOCK macros (cannot be functions since
> the type of the lock variable varies).
>
> The nice spinlock function can then also be used if !GENERIC_LOCKBREAK
hm, there was some lockdep complication in this area. I guess we could
use the 'nice' variants too if their irq-enabling/disabling was properly
lockdep annotated and tracked by the irqflags machinery?
Ingo
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-05-09 10:26 ` Ingo Molnar
@ 2008-05-09 16:28 ` Christoph Lameter
0 siblings, 0 replies; 35+ messages in thread
From: Christoph Lameter @ 2008-05-09 16:28 UTC (permalink / raw)
To: Ingo Molnar; +Cc: linux-kernel, Peter Zijlstra
On Fri, 9 May 2008, Ingo Molnar wrote:
> hm, there was some lockdep complication in this area. I guess we could
> use the 'nice' variants too if their irq-enabling/disabling was properly
> lockdep annotated and tracked by the irqflags machinery?
I thought that we already fall back to the functions that spin while
having interrupts disabled if lockdep is on?
One issue: The BREAKLOCK value needs to be 1 in the fallback case
otherwise we needlessly run atomic ops.
---
kernel/spinlock.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Index: linux-2.6/kernel/spinlock.c
===================================================================
--- linux-2.6.orig/kernel/spinlock.c 2008-05-07 16:00:06.000000000 -0700
+++ linux-2.6/kernel/spinlock.c 2008-05-07 16:00:29.000000000 -0700
@@ -218,7 +218,7 @@ do { \
#else
-#define BREAKLOCK(lock) 0
+#define BREAKLOCK(lock) 1
#define BREAKLOCK_ENABLE(lock) do { } while (0)
#define BREAKLOCK_DISABLE(lock) do { } while (0)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks waiting with interrupts disabled / preempt disabled.
2008-05-07 7:30 ` Ingo Molnar
2008-05-07 17:04 ` Christoph Lameter
@ 2008-05-09 16:35 ` Olaf Weber
2008-05-09 17:56 ` Peter Zijlstra
1 sibling, 1 reply; 35+ messages in thread
From: Olaf Weber @ 2008-05-09 16:35 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Christoph Lameter, linux-kernel, Peter Zijlstra
Ingo Molnar writes:
> * Christoph Lameter <clameter@sgi.com> wrote:
>> @@ -132,10 +132,14 @@ unsigned long __lockfunc _write_lock_irq
>> {
>> unsigned long flags;
>>
>> +retry:
>> local_irq_save(flags);
>> - preempt_disable();
>> - _raw_write_lock(lock);
>> - return flags;
>> + if (_write_trylock(lock))
>> + return flags;
>> + local_irq_restore(flags);
>> + while (!write_can_lock(lock))
>> + cpu_relax();
>> + goto retry;
>> }
>> EXPORT_SYMBOL(_write_lock_irqsave);
> hm, this is done on a too high level and will turn off some debugging
> code. I.e. if we dont just loop long but truly deadlock here we wont
> call lib/spinlock_debug.c's _raw_write_lock() code that does some sanity
> checks in the debug case.
> so how about doing this on a deeper level and adding a new
> __raw_write_lock_flags() primitive that would look at the flags value
> and could enable interrupts in the lowlevel code?
Thinking of an alternative approach, I'd like to revisit the original
problem for a moment: on a (big) system running a kernel based on
2.6.16 we found that in the following stack trace, pdflush has just
enabled irqs again on its cpu after running with irqs disabled for
about 2 minutes.
>> bt 0xe0000168f4378000
================================================================
STACK TRACE FOR TASK: 0xe0000168f4378000 (pdflush)
0 kdba_main_loop+0x14c [0xa0000001004190cc]
1 kdb+0x11fc [0xa0000001002a639c]
2 kdb_ipi+0x11c [0xa0000001002a12dc]
3 handle_IPI+0x27c [0xa00000010005671c]
4 handle_IRQ_event+0x8c [0xa0000001000fa40c]
5 __do_IRQ+0x12c [0xa0000001000fa5cc]
6 ia64_handle_irq+0x15c [0xa00000010001213c]
7 ia64_leave_kernel [0xa00000010000cb20]
8 _write_unlock_irqrestore+0x40 [0xa000000100558500]
9 test_set_page_writeback+0x18c [0xa000000100114b6c]
10 xfs_start_page_writeback+0x2c [0xa0000002e4ab914c]
11 xfs_page_state_convert+0xa0c [0xa0000002e4abcbac]
12 xfs_vm_writepage+0x19c [0xa0000002e4abd13c]
13 mpage_writepages+0x3fc [0xa0000001001bcb5c]
14 xfs_vm_writepages+0xac [0xa0000002e4ab9eac]
15 do_writepages+0xac [0xa0000001001136cc]
16 __writeback_single_inode+0x47c [0xa0000001001b8f7c]
17 sync_sb_inodes+0x4ac [0xa0000001001b9cec]
18 writeback_inodes+0x19c [0xa0000001001baafc]
19 wb_kupdate+0x26c [0xa000000100113bcc]
20 pdflush+0x38c [0xa00000010011582c]
21 kthread+0x22c [0xa0000001000c9dac]
22 kernel_thread_helper+0xcc [0xa000000100012f6c]
23 start_kernel_thread+0x1c [0xa0000001000094bc]
================================================================
There is, as far as I could tell, no code in the critical region of
test_set_page_writeback that could take this much time, except for the
call to _write_lock_irqsave itself. There the root of the problem
seems to have been that many threads went after the read lock, and an
rwlock_t will allow readers in while readers hold the lock, even if a
writer needs it, thus potentially starving the writer indefinitely (up
to 2 minutes in the case under discussion).
Re-enabling irqs when looping in _write_lock_irqsave (and *_irq and
*_bh) works here because _write_lock_irqsave() wasn't called with irqs
disabled. But pdflush will still spend a long time getting the lock.
The reason we're pursuing this nevertheless is that it is easy to do
starting from the current code-base, and doesn't impose API or ABI
changes.
An alternative approach would to view this as a bug of the rwlock_t
code in particular, and fix it by keeping new readers from a acquiring
an rwlock_t if a writer is also trying to lock it. For x86 at least,
I can sketch an implementation that is based on the new ticket-based
spinlocks:
- use the ticket-based lock for writers
- add a reader count for readers (16 bits out of 32 are free)
- readers get in if the write lock is idle
- if the write lock wasn't idle, readers loop until it is
- on obtaining the ticket-based lock, the writer waits for active
readers to leave (eg reader count to go to zero)
Contention by writers now keeps readers out, which may be a problem.
Advantages of such an approach:
- Writers are "fair" amongst themselves.
- Readers cannot hold off writers.
Disadvantages:
- Contention among writers can hold off readers indefinitely.
This scheme only works if the write lock is not "permanently"
contended, otherwise it goes from frying pan to fire.
- Requires changes to the asm-coded locking primitives, which means
it is likely to be done for x86 and ia64 only, unless the platform
maintainers step in.
Given the above, is this worth pursuing for inclusion in mainline?
A third solution would be to look at this problem on a case-by-case
basis. In the case under discussion, there may be other kernel bugs
that aggravate the problem (it is an old kernel after all) and maybe
this just means the address_space.tree_lock ought to be replaced with
something else (though I wouldn't claim to know what).
--
Olaf Weber SGI Phone: +31(0)30-6696752
Veldzigt 2b Fax: +31(0)30-6696799
Technical Lead 3454 PW de Meern Vnet: 955-7151
Storage Software The Netherlands Email: olaf@sgi.com
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks waiting with interrupts disabled / preempt disabled.
2008-05-09 16:35 ` Spinlocks waiting with interrupts disabled / preempt disabled Olaf Weber
@ 2008-05-09 17:56 ` Peter Zijlstra
2008-05-09 18:00 ` Christoph Lameter
2008-05-09 20:01 ` Olaf Weber
0 siblings, 2 replies; 35+ messages in thread
From: Peter Zijlstra @ 2008-05-09 17:56 UTC (permalink / raw)
To: Olaf Weber; +Cc: Ingo Molnar, Christoph Lameter, linux-kernel
On Fri, 2008-05-09 at 18:35 +0200, Olaf Weber wrote:
> Ingo Molnar writes:
> > * Christoph Lameter <clameter@sgi.com> wrote:
>
> >> @@ -132,10 +132,14 @@ unsigned long __lockfunc _write_lock_irq
> >> {
> >> unsigned long flags;
> >>
> >> +retry:
> >> local_irq_save(flags);
> >> - preempt_disable();
> >> - _raw_write_lock(lock);
> >> - return flags;
> >> + if (_write_trylock(lock))
> >> + return flags;
> >> + local_irq_restore(flags);
> >> + while (!write_can_lock(lock))
> >> + cpu_relax();
> >> + goto retry;
> >> }
> >> EXPORT_SYMBOL(_write_lock_irqsave);
>
> > hm, this is done on a too high level and will turn off some debugging
> > code. I.e. if we dont just loop long but truly deadlock here we wont
> > call lib/spinlock_debug.c's _raw_write_lock() code that does some sanity
> > checks in the debug case.
>
> > so how about doing this on a deeper level and adding a new
> > __raw_write_lock_flags() primitive that would look at the flags value
> > and could enable interrupts in the lowlevel code?
>
> Thinking of an alternative approach, I'd like to revisit the original
> problem for a moment: on a (big) system running a kernel based on
> 2.6.16 we found that in the following stack trace, pdflush has just
> enabled irqs again on its cpu after running with irqs disabled for
> about 2 minutes.
>
> >> bt 0xe0000168f4378000
> ================================================================
> STACK TRACE FOR TASK: 0xe0000168f4378000 (pdflush)
>
> 0 kdba_main_loop+0x14c [0xa0000001004190cc]
> 1 kdb+0x11fc [0xa0000001002a639c]
> 2 kdb_ipi+0x11c [0xa0000001002a12dc]
> 3 handle_IPI+0x27c [0xa00000010005671c]
> 4 handle_IRQ_event+0x8c [0xa0000001000fa40c]
> 5 __do_IRQ+0x12c [0xa0000001000fa5cc]
> 6 ia64_handle_irq+0x15c [0xa00000010001213c]
> 7 ia64_leave_kernel [0xa00000010000cb20]
> 8 _write_unlock_irqrestore+0x40 [0xa000000100558500]
> 9 test_set_page_writeback+0x18c [0xa000000100114b6c]
> 10 xfs_start_page_writeback+0x2c [0xa0000002e4ab914c]
> 11 xfs_page_state_convert+0xa0c [0xa0000002e4abcbac]
> 12 xfs_vm_writepage+0x19c [0xa0000002e4abd13c]
> 13 mpage_writepages+0x3fc [0xa0000001001bcb5c]
> 14 xfs_vm_writepages+0xac [0xa0000002e4ab9eac]
> 15 do_writepages+0xac [0xa0000001001136cc]
> 16 __writeback_single_inode+0x47c [0xa0000001001b8f7c]
> 17 sync_sb_inodes+0x4ac [0xa0000001001b9cec]
> 18 writeback_inodes+0x19c [0xa0000001001baafc]
> 19 wb_kupdate+0x26c [0xa000000100113bcc]
> 20 pdflush+0x38c [0xa00000010011582c]
> 21 kthread+0x22c [0xa0000001000c9dac]
> 22 kernel_thread_helper+0xcc [0xa000000100012f6c]
> 23 start_kernel_thread+0x1c [0xa0000001000094bc]
> ================================================================
>
> There is, as far as I could tell, no code in the critical region of
> test_set_page_writeback that could take this much time, except for the
> call to _write_lock_irqsave itself. There the root of the problem
> seems to have been that many threads went after the read lock, and an
> rwlock_t will allow readers in while readers hold the lock, even if a
> writer needs it, thus potentially starving the writer indefinitely (up
> to 2 minutes in the case under discussion).
>
> Re-enabling irqs when looping in _write_lock_irqsave (and *_irq and
> *_bh) works here because _write_lock_irqsave() wasn't called with irqs
> disabled. But pdflush will still spend a long time getting the lock.
>
> The reason we're pursuing this nevertheless is that it is easy to do
> starting from the current code-base, and doesn't impose API or ABI
> changes.
This would be your best option for you kabi constrained .16 kernel.
> An alternative approach would to view this as a bug of the rwlock_t
> code in particular, and fix it by keeping new readers from a acquiring
> an rwlock_t if a writer is also trying to lock it. For x86 at least,
> I can sketch an implementation that is based on the new ticket-based
> spinlocks:
> - use the ticket-based lock for writers
> - add a reader count for readers (16 bits out of 32 are free)
> - readers get in if the write lock is idle
> - if the write lock wasn't idle, readers loop until it is
> - on obtaining the ticket-based lock, the writer waits for active
> readers to leave (eg reader count to go to zero)
> Contention by writers now keeps readers out, which may be a problem.
>
> Advantages of such an approach:
> - Writers are "fair" amongst themselves.
> - Readers cannot hold off writers.
>
> Disadvantages:
> - Contention among writers can hold off readers indefinitely.
> This scheme only works if the write lock is not "permanently"
> contended, otherwise it goes from frying pan to fire.
> - Requires changes to the asm-coded locking primitives, which means
> it is likely to be done for x86 and ia64 only, unless the platform
> maintainers step in.
>
> Given the above, is this worth pursuing for inclusion in mainline?
Keep in mind that rwlock_t must support recusive read locks. Much code
depends on this.
> A third solution would be to look at this problem on a case-by-case
> basis. In the case under discussion, there may be other kernel bugs
> that aggravate the problem (it is an old kernel after all) and maybe
> this just means the address_space.tree_lock ought to be replaced with
> something else (though I wouldn't claim to know what).
That has been done by Nick Piggin, the lockless pagecache work removes
the read side of the tree_lock.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks waiting with interrupts disabled / preempt disabled.
2008-05-09 17:56 ` Peter Zijlstra
@ 2008-05-09 18:00 ` Christoph Lameter
2008-05-09 18:06 ` Peter Zijlstra
2008-05-09 20:01 ` Olaf Weber
1 sibling, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2008-05-09 18:00 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Olaf Weber, Ingo Molnar, linux-kernel, npiggin
On Fri, 9 May 2008, Peter Zijlstra wrote:
> > A third solution would be to look at this problem on a case-by-case
> > basis. In the case under discussion, there may be other kernel bugs
> > that aggravate the problem (it is an old kernel after all) and maybe
> > this just means the address_space.tree_lock ought to be replaced with
> > something else (though I wouldn't claim to know what).
>
> That has been done by Nick Piggin, the lockless pagecache work removes
> the read side of the tree_lock.
When is that going to get merged? find_get_page is still:
struct page * find_get_page(struct address_space *mapping, pgoff_t offset)
{
struct page *page;
read_lock_irq(&mapping->tree_lock);
page = radix_tree_lookup(&mapping->page_tree, offset);
if (page)
page_cache_get(page);
read_unlock_irq(&mapping->tree_lock);
return page;
}
EXPORT_SYMBOL(find_get_page);
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks waiting with interrupts disabled / preempt disabled.
2008-05-09 18:00 ` Christoph Lameter
@ 2008-05-09 18:06 ` Peter Zijlstra
0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2008-05-09 18:06 UTC (permalink / raw)
To: Christoph Lameter
Cc: Olaf Weber, Ingo Molnar, linux-kernel, npiggin, Andrew Morton,
Linus Torvalds
On Fri, 2008-05-09 at 11:00 -0700, Christoph Lameter wrote:
> On Fri, 9 May 2008, Peter Zijlstra wrote:
>
> > > A third solution would be to look at this problem on a case-by-case
> > > basis. In the case under discussion, there may be other kernel bugs
> > > that aggravate the problem (it is an old kernel after all) and maybe
> > > this just means the address_space.tree_lock ought to be replaced with
> > > something else (though I wouldn't claim to know what).
> >
> > That has been done by Nick Piggin, the lockless pagecache work removes
> > the read side of the tree_lock.
>
> When is that going to get merged? find_get_page is still:
>
> struct page * find_get_page(struct address_space *mapping, pgoff_t offset)
> {
> struct page *page;
>
> read_lock_irq(&mapping->tree_lock);
> page = radix_tree_lookup(&mapping->page_tree, offset);
> if (page)
> page_cache_get(page);
> read_unlock_irq(&mapping->tree_lock);
> return page;
> }
> EXPORT_SYMBOL(find_get_page);
Yeah, no idea when - although it looks like Nick has got another
speculative reference user lined up, which would greatly help its case.
Nick, can you post a series of patches against something recent that has
the latest lockless pagecache and all the fast_gup work in it, including
the power bits that use sepculative get?
We can then go over all this stuff and perhaps manage to trick Andrew
and Linus into taking it for .27.
Linus seemed interrested enough in the fast_gup bits.
And just in case someone forgot, the lockless pagecache has been in -rt
for a long long time - so far I've never seen issues with it.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks waiting with interrupts disabled / preempt disabled.
2008-05-09 17:56 ` Peter Zijlstra
2008-05-09 18:00 ` Christoph Lameter
@ 2008-05-09 20:01 ` Olaf Weber
1 sibling, 0 replies; 35+ messages in thread
From: Olaf Weber @ 2008-05-09 20:01 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Ingo Molnar, Christoph Lameter, linux-kernel
Peter Zijlstra writes:
> On Fri, 2008-05-09 at 18:35 +0200, Olaf Weber wrote:
>> An alternative approach would to view this as a bug of the rwlock_t
>> code in particular, and fix it by keeping new readers from a acquiring
>> an rwlock_t if a writer is also trying to lock it. For x86 at least,
>> I can sketch an implementation that is based on the new ticket-based
>> spinlocks:
>> - use the ticket-based lock for writers
>> - add a reader count for readers (16 bits out of 32 are free)
>> - readers get in if the write lock is idle
>> - if the write lock wasn't idle, readers loop until it is
>> - on obtaining the ticket-based lock, the writer waits for active
>> readers to leave (eg reader count to go to zero)
>> Contention by writers now keeps readers out, which may be a problem.
>>
>> Advantages of such an approach:
>> - Writers are "fair" amongst themselves.
>> - Readers cannot hold off writers.
>>
>> Disadvantages:
>> - Contention among writers can hold off readers indefinitely.
>> This scheme only works if the write lock is not "permanently"
>> contended, otherwise it goes from frying pan to fire.
>> - Requires changes to the asm-coded locking primitives, which means
>> it is likely to be done for x86 and ia64 only, unless the platform
>> maintainers step in.
>>
>> Given the above, is this worth pursuing for inclusion in mainline?
> Keep in mind that rwlock_t must support recusive read locks. Much code
> depends on this.
Ouch. Given the possibility of something like this:
thread A thread B
read_lock
write_lock -- attempt, will spin
read_lock -- recursive
we'd have to deal with not having the write be a barrier to the
recursive read_lock, yet a barrier to a new read_lock. That's not
possible without keeping track of read_lock owners, which would
greatly complicate the locks.
IMHO recursive locking is moderately evil at the best of time, but if
it's there, its there. (And if it's necessary, it has to be allowed.)
>> A third solution would be to look at this problem on a case-by-case
>> basis. In the case under discussion, there may be other kernel bugs
>> that aggravate the problem (it is an old kernel after all) and maybe
>> this just means the address_space.tree_lock ought to be replaced with
>> something else (though I wouldn't claim to know what).
> That has been done by Nick Piggin, the lockless pagecache work removes
> the read side of the tree_lock.
Now that is good news.
--
Olaf Weber SGI Phone: +31(0)30-6696752
Veldzigt 2b Fax: +31(0)30-6696799
Technical Lead 3454 PW de Meern Vnet: 955-7151
Storage Software The Netherlands Email: olaf@sgi.com
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-05-07 18:49 ` Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable Christoph Lameter
2008-05-09 10:26 ` Ingo Molnar
@ 2008-06-23 17:19 ` Petr Tesarik
2008-06-23 17:54 ` Peter Zijlstra
2008-06-23 18:20 ` Christoph Lameter
1 sibling, 2 replies; 35+ messages in thread
From: Petr Tesarik @ 2008-06-23 17:19 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Ingo Molnar, linux-kernel, Peter Zijlstra
On Wed, 2008-05-07 at 11:49 -0700, Christoph Lameter wrote:
> Subject: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disabled
>
> The nice spinlock functions that enable interrupts while spinning are only
> usable if GENERIC_LOCKBREAK is set (and we do not debug locks but lock
> debugging would not be used for a production configuration).
>
> Factor out the dependencies on the ->lock_break field from the nice functions
> into a set of BREAKLOCK macros (cannot be functions since the type of the lock
> variable varies).
>
> The nice spinlock function can then also be used if !GENERIC_LOCKBREAK
>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
>
> ---
> kernel/spinlock.c | 40 ++++++++++++++++++++++++++++++----------
> 1 file changed, 30 insertions(+), 10 deletions(-)
>
> Index: linux-2.6/kernel/spinlock.c
> ===================================================================
> --- linux-2.6.orig/kernel/spinlock.c 2008-05-07 11:19:31.000000000 -0700
> +++ linux-2.6/kernel/spinlock.c 2008-05-07 11:40:56.000000000 -0700
> @@ -65,7 +65,7 @@ EXPORT_SYMBOL(_write_trylock);
> * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are
> * not re-enabled during lock-acquire (which the preempt-spin-ops do):
> */
> -#if !defined(CONFIG_GENERIC_LOCKBREAK) || defined(CONFIG_DEBUG_LOCK_ALLOC)
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
Maybe I'm just blind, but doesn't this change effectively disable any
arch-specific optimized code for _raw_*_lock?
If CONFIG_DEBUG_LOCK_ALLOC is set, then CONFIG_DEBUG_SPINLOCK must also
be set, so in that case the debugging versions of _raw_*_lock are used.
But if CONFIG_DEBUG_LOCK_ALLOC is _not_ set, then the locks are built
with _trylock and _can_lock primitives.
What am I missing here?
Petr Tesarik
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-06-23 17:19 ` Petr Tesarik
@ 2008-06-23 17:54 ` Peter Zijlstra
2008-06-23 18:20 ` Christoph Lameter
1 sibling, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2008-06-23 17:54 UTC (permalink / raw)
To: Petr Tesarik; +Cc: Christoph Lameter, Ingo Molnar, linux-kernel
On Mon, 2008-06-23 at 19:19 +0200, Petr Tesarik wrote:
> On Wed, 2008-05-07 at 11:49 -0700, Christoph Lameter wrote:
> > Subject: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disabled
> >
> > The nice spinlock functions that enable interrupts while spinning are only
> > usable if GENERIC_LOCKBREAK is set (and we do not debug locks but lock
> > debugging would not be used for a production configuration).
> >
> > Factor out the dependencies on the ->lock_break field from the nice functions
> > into a set of BREAKLOCK macros (cannot be functions since the type of the lock
> > variable varies).
> >
> > The nice spinlock function can then also be used if !GENERIC_LOCKBREAK
> >
> > Signed-off-by: Christoph Lameter <clameter@sgi.com>
> >
> > ---
> > kernel/spinlock.c | 40 ++++++++++++++++++++++++++++++----------
> > 1 file changed, 30 insertions(+), 10 deletions(-)
> >
> > Index: linux-2.6/kernel/spinlock.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/spinlock.c 2008-05-07 11:19:31.000000000 -0700
> > +++ linux-2.6/kernel/spinlock.c 2008-05-07 11:40:56.000000000 -0700
> > @@ -65,7 +65,7 @@ EXPORT_SYMBOL(_write_trylock);
> > * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are
> > * not re-enabled during lock-acquire (which the preempt-spin-ops do):
> > */
> > -#if !defined(CONFIG_GENERIC_LOCKBREAK) || defined(CONFIG_DEBUG_LOCK_ALLOC)
> > +#ifdef CONFIG_DEBUG_LOCK_ALLOC
>
> Maybe I'm just blind, but doesn't this change effectively disable any
> arch-specific optimized code for _raw_*_lock?
>
> If CONFIG_DEBUG_LOCK_ALLOC is set, then CONFIG_DEBUG_SPINLOCK must also
> be set, so in that case the debugging versions of _raw_*_lock are used.
> But if CONFIG_DEBUG_LOCK_ALLOC is _not_ set, then the locks are built
> with _trylock and _can_lock primitives.
>
> What am I missing here?
No, I think you're right.
I've been playing with these patches:
http://programming.kicks-ass.net/kernel-patches/spinlocks/
But as it stands it breaks !CONFIG_PREEMPT... ought to find some other
cycles to spend on it..
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-06-23 17:19 ` Petr Tesarik
2008-06-23 17:54 ` Peter Zijlstra
@ 2008-06-23 18:20 ` Christoph Lameter
2008-06-23 20:39 ` Peter Zijlstra
1 sibling, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2008-06-23 18:20 UTC (permalink / raw)
To: Petr Tesarik; +Cc: Ingo Molnar, linux-kernel, Peter Zijlstra
On Mon, 23 Jun 2008, Petr Tesarik wrote:
> Maybe I'm just blind, but doesn't this change effectively disable any
> arch-specific optimized code for _raw_*_lock?
True. Only the __raw_xxx_trylock is still used.
> If CONFIG_DEBUG_LOCK_ALLOC is set, then CONFIG_DEBUG_SPINLOCK must also
> be set, so in that case the debugging versions of _raw_*_lock are used.
> But if CONFIG_DEBUG_LOCK_ALLOC is _not_ set, then the locks are built
> with _trylock and _can_lock primitives.
>
> What am I missing here?
It is good that the locks are build with _trylock and _can_lock because
then we can reenable interrupts while spinning.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-06-23 18:20 ` Christoph Lameter
@ 2008-06-23 20:39 ` Peter Zijlstra
2008-06-23 20:45 ` Christoph Lameter
0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2008-06-23 20:39 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Petr Tesarik, Ingo Molnar, linux-kernel
On Mon, 2008-06-23 at 11:20 -0700, Christoph Lameter wrote:
> On Mon, 23 Jun 2008, Petr Tesarik wrote:
>
> > Maybe I'm just blind, but doesn't this change effectively disable any
> > arch-specific optimized code for _raw_*_lock?
>
> True. Only the __raw_xxx_trylock is still used.
>
> > If CONFIG_DEBUG_LOCK_ALLOC is set, then CONFIG_DEBUG_SPINLOCK must also
> > be set, so in that case the debugging versions of _raw_*_lock are used.
> > But if CONFIG_DEBUG_LOCK_ALLOC is _not_ set, then the locks are built
> > with _trylock and _can_lock primitives.
> >
> > What am I missing here?
>
> It is good that the locks are build with _trylock and _can_lock because
> then we can reenable interrupts while spinning.
Well, good and bad, the turn side is that fairness schemes like ticket
locks are utterly defeated.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-06-23 20:39 ` Peter Zijlstra
@ 2008-06-23 20:45 ` Christoph Lameter
2008-06-23 20:58 ` Peter Zijlstra
0 siblings, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2008-06-23 20:45 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Petr Tesarik, Ingo Molnar, linux-kernel
On Mon, 23 Jun 2008, Peter Zijlstra wrote:
> > It is good that the locks are build with _trylock and _can_lock because
> > then we can reenable interrupts while spinning.
>
> Well, good and bad, the turn side is that fairness schemes like ticket
> locks are utterly defeated.
True. But maybe we can make these fairness schemes more generic so that
they can go into core code?
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-06-23 20:45 ` Christoph Lameter
@ 2008-06-23 20:58 ` Peter Zijlstra
2008-06-26 2:51 ` Jeremy Fitzhardinge
0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2008-06-23 20:58 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Petr Tesarik, Ingo Molnar, linux-kernel
On Mon, 2008-06-23 at 13:45 -0700, Christoph Lameter wrote:
> On Mon, 23 Jun 2008, Peter Zijlstra wrote:
>
> > > It is good that the locks are build with _trylock and _can_lock because
> > > then we can reenable interrupts while spinning.
> >
> > Well, good and bad, the turn side is that fairness schemes like ticket
> > locks are utterly defeated.
>
> True. But maybe we can make these fairness schemes more generic so that
> they can go into core code?
The trouble with ticket locks is that they can't handle waiters going
away - or in this case getting preempted by irq handlers. The one who
took the ticket must pass it on, so if you're preempted it just sits
there being idle, until you get back to deal with the lock.
But yeah, perhaps another fairness scheme might work in the generic
code..
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-06-23 20:58 ` Peter Zijlstra
@ 2008-06-26 2:51 ` Jeremy Fitzhardinge
2008-06-26 6:51 ` Peter Zijlstra
` (3 more replies)
0 siblings, 4 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2008-06-26 2:51 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Christoph Lameter, Petr Tesarik, Ingo Molnar, linux-kernel,
Nick Piggin
Peter Zijlstra wrote:
> On Mon, 2008-06-23 at 13:45 -0700, Christoph Lameter wrote:
>
>> On Mon, 23 Jun 2008, Peter Zijlstra wrote:
>>
>>
>>>> It is good that the locks are build with _trylock and _can_lock because
>>>> then we can reenable interrupts while spinning.
>>>>
>>> Well, good and bad, the turn side is that fairness schemes like ticket
>>> locks are utterly defeated.
>>>
>> True. But maybe we can make these fairness schemes more generic so that
>> they can go into core code?
>>
>
> The trouble with ticket locks is that they can't handle waiters going
> away - or in this case getting preempted by irq handlers. The one who
> took the ticket must pass it on, so if you're preempted it just sits
> there being idle, until you get back to deal with the lock.
>
> But yeah, perhaps another fairness scheme might work in the generic
> code..
Thomas Friebel presented results at the Xen Summit this week showing
that ticket locks are an absolute disaster for scalability in a virtual
environment, for a similar reason. It's a bit irritating if the lock
holder vcpu gets preempted by the hypervisor, but its much worse when
they release the lock: unless the vcpu scheduler gives a cpu to the vcpu
with the next ticket, it can waste up to N timeslices spinning.
I'm experimenting with adding pvops hook to allow you to put in new
spinlock implementations on the fly. If nothing else, it will be useful
for experimenting with different algorithms. But it definitely seems
like the old unfair lock algorithm played much better with a virtual
environment, because the next cpu to get the lock is the next one the
scheduler gives time, rather than dictating an order - and the scheduler
should mitigate the unfairness that ticket locks were designed to solve.
J
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-06-26 2:51 ` Jeremy Fitzhardinge
@ 2008-06-26 6:51 ` Peter Zijlstra
2008-06-26 15:49 ` Jeremy Fitzhardinge
2008-06-26 9:17 ` Petr Tesarik
` (2 subsequent siblings)
3 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2008-06-26 6:51 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Peter Zijlstra, Christoph Lameter, Petr Tesarik, Ingo Molnar,
linux-kernel, Nick Piggin
On Wed, 2008-06-25 at 19:51 -0700, Jeremy Fitzhardinge wrote:
> Peter Zijlstra wrote:
> > On Mon, 2008-06-23 at 13:45 -0700, Christoph Lameter wrote:
> >
> >> On Mon, 23 Jun 2008, Peter Zijlstra wrote:
> >>
> >>
> >>>> It is good that the locks are build with _trylock and _can_lock because
> >>>> then we can reenable interrupts while spinning.
> >>>>
> >>> Well, good and bad, the turn side is that fairness schemes like ticket
> >>> locks are utterly defeated.
> >>>
> >> True. But maybe we can make these fairness schemes more generic so that
> >> they can go into core code?
> >>
> >
> > The trouble with ticket locks is that they can't handle waiters going
> > away - or in this case getting preempted by irq handlers. The one who
> > took the ticket must pass it on, so if you're preempted it just sits
> > there being idle, until you get back to deal with the lock.
> >
> > But yeah, perhaps another fairness scheme might work in the generic
> > code..
>
> Thomas Friebel presented results at the Xen Summit this week showing
> that ticket locks are an absolute disaster for scalability in a virtual
> environment, for a similar reason. It's a bit irritating if the lock
> holder vcpu gets preempted by the hypervisor, but its much worse when
> they release the lock: unless the vcpu scheduler gives a cpu to the vcpu
> with the next ticket, it can waste up to N timeslices spinning.
>
> I'm experimenting with adding pvops hook to allow you to put in new
> spinlock implementations on the fly. If nothing else, it will be useful
> for experimenting with different algorithms. But it definitely seems
> like the old unfair lock algorithm played much better with a virtual
> environment, because the next cpu to get the lock is the next one the
> scheduler gives time, rather than dictating an order - and the scheduler
> should mitigate the unfairness that ticket locks were designed to solve.
Paravirt spinlocks sounds like a good idea anyway, that way you can make
them scheduling locks (from the host's POV) when the lock owner (vcpu)
isn't running.
Burning time spinning on !running vcpus seems like a waste to me.
As for the scheduler solving the unfairness that ticket locks solve,
that cannot be done. The ticket lock solves intra-cpu fairness for a
resource other than time. The cpu scheduler only cares about fairness in
time, and its intra-cpu fairness is on a larger scale than most spinlock
hold times - so even if time and the locked resource would overlap it
wouldn't work.
The simple scenario is running N tasks on N cpus that all pound the same
lock, cache issues will make it unlikely the lock would migrate away
from whatever cpu its on, essentially starving all the other N-1 cpus.
Ticket locks solve that exact issue, all the scheduler can do is ensure
they're all spending an equal amount of time on the cpu, whether that is
spinning for lock acquisition or getting actual work done is beyond its
scope.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-06-26 2:51 ` Jeremy Fitzhardinge
2008-06-26 6:51 ` Peter Zijlstra
@ 2008-06-26 9:17 ` Petr Tesarik
2008-06-26 17:02 ` Christoph Lameter
2008-07-07 11:50 ` Nick Piggin
2008-07-07 19:46 ` Rik van Riel
3 siblings, 1 reply; 35+ messages in thread
From: Petr Tesarik @ 2008-06-26 9:17 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Peter Zijlstra, Christoph Lameter, Ingo Molnar, linux-kernel,
Nick Piggin
On Wed, 2008-06-25 at 19:51 -0700, Jeremy Fitzhardinge wrote:
> [...]
> I'm experimenting with adding pvops hook to allow you to put in new
> spinlock implementations on the fly. If nothing else, it will be useful
> for experimenting with different algorithms. But it definitely seems
> like the old unfair lock algorithm played much better with a virtual
> environment, because the next cpu to get the lock is the next one the
> scheduler gives time, rather than dictating an order - and the scheduler
> should mitigate the unfairness that ticket locks were designed to solve.
We really should paravirtualize spin locks, because there's always
something better to do than just burn time spinning. But in a
non-virtualized environment, tickets (or a similar scheme) should be
preserved.
We should probably re-think the whole locking scheme, because spinlocks
were designed to be held for a short period of time. This was a fair
assumption when they were introduced, but obviously it is now false in
many cases (such as virtualization).
Ticket-based spinlocks have actually already changed the original
design, so why not implement a generic "lock scheduler" on top of
spinlock_t and rwlock_t?
Petr Tesarik
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-06-26 6:51 ` Peter Zijlstra
@ 2008-06-26 15:49 ` Jeremy Fitzhardinge
0 siblings, 0 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2008-06-26 15:49 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Peter Zijlstra, Christoph Lameter, Petr Tesarik, Ingo Molnar,
linux-kernel, Nick Piggin
Peter Zijlstra wrote:
> Paravirt spinlocks sounds like a good idea anyway, that way you can make
> them scheduling locks (from the host's POV) when the lock owner (vcpu)
> isn't running.
>
> Burning time spinning on !running vcpus seems like a waste to me.
>
In theory. But in practice Linux locks are so low-contention that not
much time seems to get wasted. I've been doing experiments with
spin-a-while-then-block locks, but they never got to the -then-block
part in my test. The burning cycles spinning only gets expensive if the
lock-holder vcpu gets preempted, and there's other cpus spinning on that
lock; but if locks are held only briefly, then there's little chance
being preempted while holding the lock.
At least that's at the scale I've been testing, with only two cores. I
expect things look different with 8 or 16 cores and similarly scaled guests.
> As for the scheduler solving the unfairness that ticket locks solve,
>
No, I never said scheduler would the problem, merely mitigate it.
> that cannot be done. The ticket lock solves intra-cpu fairness for a
> resource other than time. The cpu scheduler only cares about fairness in
> time, and its intra-cpu fairness is on a larger scale than most spinlock
> hold times - so even if time and the locked resource would overlap it
> wouldn't work.
>
> The simple scenario is running N tasks on N cpus that all pound the same
> lock, cache issues will make it unlikely the lock would migrate away
> from whatever cpu its on, essentially starving all the other N-1 cpus.
>
Yep. But in practice, the scheduler will steal the real cpu from under
the vcpu dominating the lock and upset the pathalogical pattern. I'm
not saying its ideal, but the starvation case that ticketlocks solve is
pretty rare in the large scheme of things.
Also, ticket locks don't help either, if the lock is always
transitioning between locked->unlocked->locked on all cpus. It only
helps in the case of one cpu doing rapid lock->unlock transitions while
others wait on the lock.
> Ticket locks solve that exact issue, all the scheduler can do is ensure
> they're all spending an equal amount of time on the cpu, whether that is
> spinning for lock acquisition or getting actual work done is beyond its
> scope.
>
Yes. But the problem with ticket locks is that they dictate a
scheduling order, and if you fail to schedule in that order vast amounts
of time are wasted. You can get into this state:
1. vcpu A takes a lock
2. vcpu A is preempted, effectively making a 5us lock be held for 30ms
3. vcpus E,D,C,B try to take the lock in that order
4. they all spin, wasting time. bad, but no worse than the old lock
algorithm
5. vcpu A eventually runs again and releases the lock
6. vcpu B runs, spinning until preempted
7. vcpu C runs, spinning until preempted
8. vcpu D runs, spinning until preempted
9. vcpu E runs, and takes the lock and releases it
10. (repeat spinning on B,C,D until D gets the lock)
11. (repeat spinning on B,C until C gets the lock)
12. B finally gets the lock
Steps 6-12 are all caused by ticket locks, and the situation is
exacerbated by vcpus F-Z trying to get the lock in the meantime while
its all tangled up handing out tickets in the right order.
The problem is that the old lock-byte locks made no fairness guarantees,
and interacted badly with the hardware causing severe starvation in some
cases. Ticket locks are too fair, and absolutely dictate the order in
which the lock is taken. Really, all that's needed is the weaker
assertion that "when I release the lock, any current spinner should get
the lock".
J
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-06-26 9:17 ` Petr Tesarik
@ 2008-06-26 17:02 ` Christoph Lameter
2008-06-26 17:48 ` Peter Zijlstra
0 siblings, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2008-06-26 17:02 UTC (permalink / raw)
To: Petr Tesarik
Cc: Jeremy Fitzhardinge, Peter Zijlstra, Ingo Molnar, linux-kernel,
Nick Piggin
On Thu, 26 Jun 2008, Petr Tesarik wrote:
> We should probably re-think the whole locking scheme, because spinlocks
> were designed to be held for a short period of time. This was a fair
> assumption when they were introduced, but obviously it is now false in
> many cases (such as virtualization).
And NUMA.
> Ticket-based spinlocks have actually already changed the original
> design, so why not implement a generic "lock scheduler" on top of
> spinlock_t and rwlock_t?
How about making the semaphore / mutex code as fast as spinlocks and just
use that?
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-06-26 17:02 ` Christoph Lameter
@ 2008-06-26 17:48 ` Peter Zijlstra
0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2008-06-26 17:48 UTC (permalink / raw)
To: Christoph Lameter
Cc: Petr Tesarik, Jeremy Fitzhardinge, Peter Zijlstra, Ingo Molnar,
linux-kernel, Nick Piggin
On Thu, 2008-06-26 at 10:02 -0700, Christoph Lameter wrote:
> On Thu, 26 Jun 2008, Petr Tesarik wrote:
>
> > We should probably re-think the whole locking scheme, because spinlocks
> > were designed to be held for a short period of time. This was a fair
> > assumption when they were introduced, but obviously it is now false in
> > many cases (such as virtualization).
>
> And NUMA.
>
> > Ticket-based spinlocks have actually already changed the original
> > design, so why not implement a generic "lock scheduler" on top of
> > spinlock_t and rwlock_t?
>
> How about making the semaphore / mutex code as fast as spinlocks and just
> use that?
Hehe, please have a look at a recent -rt tree and possibly test some of
that if you're interested.
It has adaptive spins for rt_mutex. Which is to say, as long as the lock
owner is current on its cpu, we'll spin trying to acquire the lock. The
reasoning here is that as long as the owner is running, there is a fair
chance it'll release the lock soonish.
This is another case where semaphores can never match, due to their lack
of ownership.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-06-26 2:51 ` Jeremy Fitzhardinge
2008-06-26 6:51 ` Peter Zijlstra
2008-06-26 9:17 ` Petr Tesarik
@ 2008-07-07 11:50 ` Nick Piggin
2008-07-07 11:52 ` Nick Piggin
2008-07-07 15:53 ` Jeremy Fitzhardinge
2008-07-07 19:46 ` Rik van Riel
3 siblings, 2 replies; 35+ messages in thread
From: Nick Piggin @ 2008-07-07 11:50 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Peter Zijlstra, Christoph Lameter, Petr Tesarik, Ingo Molnar,
linux-kernel
On Thursday 26 June 2008 12:51, Jeremy Fitzhardinge wrote:
> Peter Zijlstra wrote:
> > On Mon, 2008-06-23 at 13:45 -0700, Christoph Lameter wrote:
> >> On Mon, 23 Jun 2008, Peter Zijlstra wrote:
> >>>> It is good that the locks are build with _trylock and _can_lock
> >>>> because then we can reenable interrupts while spinning.
> >>>
> >>> Well, good and bad, the turn side is that fairness schemes like ticket
> >>> locks are utterly defeated.
> >>
> >> True. But maybe we can make these fairness schemes more generic so that
> >> they can go into core code?
> >
> > The trouble with ticket locks is that they can't handle waiters going
> > away - or in this case getting preempted by irq handlers. The one who
> > took the ticket must pass it on, so if you're preempted it just sits
> > there being idle, until you get back to deal with the lock.
> >
> > But yeah, perhaps another fairness scheme might work in the generic
> > code..
>
> Thomas Friebel presented results at the Xen Summit this week showing
> that ticket locks are an absolute disaster for scalability in a virtual
> environment, for a similar reason. It's a bit irritating if the lock
> holder vcpu gets preempted by the hypervisor, but its much worse when
> they release the lock: unless the vcpu scheduler gives a cpu to the vcpu
> with the next ticket, it can waste up to N timeslices spinning.
I didn't realise it is good practice to run multiple "virtual CPUs"
of the same guest on a single physical CPU on the host...
> I'm experimenting with adding pvops hook to allow you to put in new
> spinlock implementations on the fly. If nothing else, it will be useful
> for experimenting with different algorithms. But it definitely seems
> like the old unfair lock algorithm played much better with a virtual
> environment, because the next cpu to get the lock is the next one the
> scheduler gives time, rather than dictating an order - and the scheduler
> should mitigate the unfairness that ticket locks were designed to solve.
... if it is good practice, then, virtualizing spinlocks I guess is
reasonable. If not, then "don't do that". Considering that probably
many bare metal systems will run pv kernels, every little cost adds
up.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-07-07 11:50 ` Nick Piggin
@ 2008-07-07 11:52 ` Nick Piggin
2008-07-07 15:56 ` Jeremy Fitzhardinge
2008-07-07 15:53 ` Jeremy Fitzhardinge
1 sibling, 1 reply; 35+ messages in thread
From: Nick Piggin @ 2008-07-07 11:52 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Peter Zijlstra, Christoph Lameter, Petr Tesarik, Ingo Molnar,
linux-kernel
On Monday 07 July 2008 21:50, Nick Piggin wrote:
> On Thursday 26 June 2008 12:51, Jeremy Fitzhardinge wrote:
> > Thomas Friebel presented results at the Xen Summit this week showing
> > that ticket locks are an absolute disaster for scalability in a virtual
> > environment, for a similar reason. It's a bit irritating if the lock
> > holder vcpu gets preempted by the hypervisor, but its much worse when
> > they release the lock: unless the vcpu scheduler gives a cpu to the vcpu
> > with the next ticket, it can waste up to N timeslices spinning.
>
> I didn't realise it is good practice to run multiple "virtual CPUs"
> of the same guest on a single physical CPU on the host...
>
> > I'm experimenting with adding pvops hook to allow you to put in new
> > spinlock implementations on the fly. If nothing else, it will be useful
> > for experimenting with different algorithms. But it definitely seems
> > like the old unfair lock algorithm played much better with a virtual
> > environment, because the next cpu to get the lock is the next one the
> > scheduler gives time, rather than dictating an order - and the scheduler
> > should mitigate the unfairness that ticket locks were designed to solve.
>
> ... if it is good practice, then, virtualizing spinlocks I guess is
> reasonable. If not, then "don't do that". Considering that probably
> many bare metal systems will run pv kernels, every little cost adds
> up.
Although, you wouldn't need to oversubscribe physical CPUs to hit
suboptimal behaviour.
Basically, I just ask for performance improvement to be measured
with some "realistic" configuration, then it should be easier to
justify.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-07-07 11:50 ` Nick Piggin
2008-07-07 11:52 ` Nick Piggin
@ 2008-07-07 15:53 ` Jeremy Fitzhardinge
1 sibling, 0 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-07 15:53 UTC (permalink / raw)
To: Nick Piggin
Cc: Peter Zijlstra, Christoph Lameter, Petr Tesarik, Ingo Molnar,
linux-kernel
Nick Piggin wrote:
> On Thursday 26 June 2008 12:51, Jeremy Fitzhardinge wrote:
>
>> Peter Zijlstra wrote:
>>
>>> On Mon, 2008-06-23 at 13:45 -0700, Christoph Lameter wrote:
>>>
>>>> On Mon, 23 Jun 2008, Peter Zijlstra wrote:
>>>>
>>>>>> It is good that the locks are build with _trylock and _can_lock
>>>>>> because then we can reenable interrupts while spinning.
>>>>>>
>>>>> Well, good and bad, the turn side is that fairness schemes like ticket
>>>>> locks are utterly defeated.
>>>>>
>>>> True. But maybe we can make these fairness schemes more generic so that
>>>> they can go into core code?
>>>>
>>> The trouble with ticket locks is that they can't handle waiters going
>>> away - or in this case getting preempted by irq handlers. The one who
>>> took the ticket must pass it on, so if you're preempted it just sits
>>> there being idle, until you get back to deal with the lock.
>>>
>>> But yeah, perhaps another fairness scheme might work in the generic
>>> code..
>>>
>> Thomas Friebel presented results at the Xen Summit this week showing
>> that ticket locks are an absolute disaster for scalability in a virtual
>> environment, for a similar reason. It's a bit irritating if the lock
>> holder vcpu gets preempted by the hypervisor, but its much worse when
>> they release the lock: unless the vcpu scheduler gives a cpu to the vcpu
>> with the next ticket, it can waste up to N timeslices spinning.
>>
>
> I didn't realise it is good practice to run multiple "virtual CPUs"
> of the same guest on a single physical CPU on the host...
>
It isn't. It makes no sense at all to give a guest more vcpus than
physical cpus, so that kind of contention won't happen in general. But
the bad locking scenario happens when there's any system-wide
contention, so it could happen if some other virtual machine preempts a
vcpu holding a lock. And once a lock ends up being (effectively) held
for 30ms rather than 30us, the likelihood of going into contention goes
way up, and you can enter the catastrophic N^2 unlock->relock state.
My measurements show that reverting to the old lock-byte algorithm
avoids the worst case, and just results in a bit of excessive spinning.
Replacing it with a smarter spin-then-block-vcpu algorithm doesn't
really benefit the specific guest VM very much (kernbench elapsed time
is only slightly improved), but its consumption of physical cpu time can
go down by ~10%.
>> I'm experimenting with adding pvops hook to allow you to put in new
>> spinlock implementations on the fly. If nothing else, it will be useful
>> for experimenting with different algorithms. But it definitely seems
>> like the old unfair lock algorithm played much better with a virtual
>> environment, because the next cpu to get the lock is the next one the
>> scheduler gives time, rather than dictating an order - and the scheduler
>> should mitigate the unfairness that ticket locks were designed to solve.
>>
>
> ... if it is good practice, then, virtualizing spinlocks I guess is
> reasonable. If not, then "don't do that". Considering that probably
> many bare metal systems will run pv kernels, every little cost adds
> up
I'm aware of that. In my current implementation the overhead amounts to
an extra direct call in the lock/unlock path, but that can be eliminated
with a small amount of restructuring (by making spin_lock/unlock()
inline functions, and having the call to raw_spin_lock/unlock within
it). The pvops patching machinery removes any indirect calls or jumps.
J
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-07-07 11:52 ` Nick Piggin
@ 2008-07-07 15:56 ` Jeremy Fitzhardinge
2008-07-08 2:08 ` Nick Piggin
0 siblings, 1 reply; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-07 15:56 UTC (permalink / raw)
To: Nick Piggin
Cc: Peter Zijlstra, Christoph Lameter, Petr Tesarik, Ingo Molnar,
linux-kernel
Nick Piggin wrote:
> Although, you wouldn't need to oversubscribe physical CPUs to hit
> suboptimal behaviour.
>
> Basically, I just ask for performance improvement to be measured
> with some "realistic" configuration, then it should be easier to
> justify.
Overcommitting cpus system-wide is pretty common. A use-case is
consolidating a pile of underused non-performance-critical servers into
one physical machine with a lot of VMs. It doesn't take much load to
cause a dhcp request to the dhcp server to steal a cpu away from the
mysql server while it holds a lock...
I'll mail out my patches so we can have a more concrete discussion shortly.
J
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-06-26 2:51 ` Jeremy Fitzhardinge
` (2 preceding siblings ...)
2008-07-07 11:50 ` Nick Piggin
@ 2008-07-07 19:46 ` Rik van Riel
2008-07-07 20:14 ` Jeremy Fitzhardinge
3 siblings, 1 reply; 35+ messages in thread
From: Rik van Riel @ 2008-07-07 19:46 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Peter Zijlstra, Christoph Lameter, Petr Tesarik, Ingo Molnar,
linux-kernel, Nick Piggin
On Wed, 25 Jun 2008 19:51:12 -0700
Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> Thomas Friebel presented results at the Xen Summit this week showing
> that ticket locks are an absolute disaster for scalability in a virtual
> environment, for a similar reason. It's a bit irritating if the lock
> holder vcpu gets preempted by the hypervisor, but its much worse when
> they release the lock: unless the vcpu scheduler gives a cpu to the vcpu
> with the next ticket, it can waste up to N timeslices spinning.
>
> I'm experimenting with adding pvops hook to allow you to put in new
> spinlock implementations on the fly.
Alternatively, the guest could tell the host which vcpus
are next in line for a ticket spinlock, or a vcpu that gets
scheduled but is not supposed to grab the lock yet can give
some CPU time to the vcpu that should get the lock next.
I believe the IBM PPC64 people have done some work to implement
just that.
--
All Rights Reversed
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-07-07 19:46 ` Rik van Riel
@ 2008-07-07 20:14 ` Jeremy Fitzhardinge
2008-07-08 2:07 ` Nick Piggin
0 siblings, 1 reply; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-07 20:14 UTC (permalink / raw)
To: Rik van Riel
Cc: Peter Zijlstra, Christoph Lameter, Petr Tesarik, Ingo Molnar,
linux-kernel, Nick Piggin
Rik van Riel wrote:
> Alternatively, the guest could tell the host which vcpus
> are next in line for a ticket spinlock, or a vcpu that gets
> scheduled but is not supposed to grab the lock yet can give
> some CPU time to the vcpu that should get the lock next.
>
Those are possible, but would either 1) require hypervisor changes,
and/or 2) changes no less extensive than the ones I had to make anyway.
Thomas's proposal was to modify the scheduler to try to avoiding
preempting vcpus while they're in kernel mode. That's nice because it
requires no guest changes, and seems at least somewhat successful at
mitigating the problem. But it can't completely solve the problem, and
you end up with a bunch of heuristics in the hypervisor to decide who to
preempt.
The other point, of course, is that ticket locks are massive overkill
for the problem they're trying to solve. It's one thing to introduce an
element of fairness into spinlocks, its another to impose strict FIFO
ordering. It would be enough to make the locks "polite" by preventing a
new lock-holder from taking the lock while its under contention.
Something like:
union lock {
unsigned short word;
struct { unsigned char lock, count; };
};
spin_lock: # ebx - lock pointer
movw $0x0001, %ax # add 1 to lock, 0 to count
xaddw %ax, (%ebx) # attempt to take lock and test user count
testw %ax,%ax
jnz slow
taken: ret
# slow path
slow: lock incb 1(%ebx) # inc count
1: rep;nop
cmpb $0,(%ebx)
jnz 1b # wait for unlocked
movb $1,%al # attempt to take lock (count already increased)
xchgb %al,(%ebx)
testb %al,%al
jnz 1b
lock decb 1(%ebx) # drop count
jmp taken
spin_unlock:
movb $0,(%ebx)
ret
The uncontended fastpath is similar to the pre-ticket locks, but it
refuses to take the lock if there are other waiters, even if the lock is
not currently held. This prevents the rapid lock-unlock cycle on one
CPU from starving another CPU, which I understand was the original
problem tickets locks were trying to solve.
But it also means that all the contended spinners get the lock in
whatever order the system decides to give it to them, rather than
imposing a strict order.
> I believe the IBM PPC64 people have done some work to implement
> just that.
>
Do you have any references?
J
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-07-07 20:14 ` Jeremy Fitzhardinge
@ 2008-07-08 2:07 ` Nick Piggin
2008-07-08 5:57 ` Jeremy Fitzhardinge
0 siblings, 1 reply; 35+ messages in thread
From: Nick Piggin @ 2008-07-08 2:07 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Rik van Riel, Peter Zijlstra, Christoph Lameter, Petr Tesarik,
Ingo Molnar, linux-kernel
On Tuesday 08 July 2008 06:14, Jeremy Fitzhardinge wrote:
> The other point, of course, is that ticket locks are massive overkill
> for the problem they're trying to solve.
No they aren't.
> It's one thing to introduce an
> element of fairness into spinlocks, its another to impose strict FIFO
> ordering. It would be enough to make the locks "polite" by preventing a
> new lock-holder from taking the lock while its under contention.
> Something like:
>
> union lock {
> unsigned short word;
> struct { unsigned char lock, count; };
> };
>
> spin_lock: # ebx - lock pointer
> movw $0x0001, %ax # add 1 to lock, 0 to count
> xaddw %ax, (%ebx) # attempt to take lock and test user count
> testw %ax,%ax
> jnz slow
>
> taken: ret
>
> # slow path
> slow: lock incb 1(%ebx) # inc count
>
> 1: rep;nop
> cmpb $0,(%ebx)
> jnz 1b # wait for unlocked
>
> movb $1,%al # attempt to take lock (count already increased)
> xchgb %al,(%ebx)
> testb %al,%al
> jnz 1b
>
> lock decb 1(%ebx) # drop count
> jmp taken
>
> spin_unlock:
> movb $0,(%ebx)
> ret
>
>
> The uncontended fastpath is similar to the pre-ticket locks, but it
> refuses to take the lock if there are other waiters, even if the lock is
> not currently held. This prevents the rapid lock-unlock cycle on one
> CPU from starving another CPU, which I understand was the original
> problem tickets locks were trying to solve.
They prevent lots of unfairness and starvation problems. The most
prominent one (ie. actually observed in Linux) was a single CPU
being totally starved by N others (to the point where lockup timers
would kick in).
As an aside, these locks you propose are also a lot more costly in
the contended path. 4 vs 1 atomic operations on the lock cacheline
is not so great.
> But it also means that all the contended spinners get the lock in
> whatever order the system decides to give it to them, rather than
> imposing a strict order.
The exact problem is that the system actively does the wrong thing
when you allow it to decide.
Unlike simple cacheline access, I don't believe it is such a good
idea to batch up locks many times on the same CPU for example. While
it surely could improve performance in specific situations, I think
that if code is taking and releasing a lock many times, then it most
probably should be either reworked to hold the lock for longer, or
changed completely. And once locks become *really* contended, then
the cost of moving the critical section to another CPU is really
drowned out by the cost of contention itself (all the idle time,
and taking/releasing the lock cacheline).
So far my theory has held up (except for virtualized systems).
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-07-07 15:56 ` Jeremy Fitzhardinge
@ 2008-07-08 2:08 ` Nick Piggin
0 siblings, 0 replies; 35+ messages in thread
From: Nick Piggin @ 2008-07-08 2:08 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Peter Zijlstra, Christoph Lameter, Petr Tesarik, Ingo Molnar,
linux-kernel
On Tuesday 08 July 2008 01:56, Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:
> > Although, you wouldn't need to oversubscribe physical CPUs to hit
> > suboptimal behaviour.
> >
> > Basically, I just ask for performance improvement to be measured
> > with some "realistic" configuration, then it should be easier to
> > justify.
>
> Overcommitting cpus system-wide is pretty common.
(yes, sorry I meant: oversubscribing with a single guest)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-07-08 2:07 ` Nick Piggin
@ 2008-07-08 5:57 ` Jeremy Fitzhardinge
2008-07-08 8:41 ` Nick Piggin
0 siblings, 1 reply; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-08 5:57 UTC (permalink / raw)
To: Nick Piggin
Cc: Rik van Riel, Peter Zijlstra, Christoph Lameter, Petr Tesarik,
Ingo Molnar, linux-kernel
Nick Piggin wrote:
> On Tuesday 08 July 2008 06:14, Jeremy Fitzhardinge wrote:
>
>
>> The other point, of course, is that ticket locks are massive overkill
>> for the problem they're trying to solve.
>>
>
> No they aren't.
>
>
>
>> It's one thing to introduce an
>> element of fairness into spinlocks, its another to impose strict FIFO
>> ordering. It would be enough to make the locks "polite" by preventing a
>> new lock-holder from taking the lock while its under contention.
>> Something like:
>>
>> union lock {
>> unsigned short word;
>> struct { unsigned char lock, count; };
>> };
>>
>> spin_lock: # ebx - lock pointer
>> movw $0x0001, %ax # add 1 to lock, 0 to count
>> xaddw %ax, (%ebx) # attempt to take lock and test user count
>> testw %ax,%ax
>> jnz slow
>>
>> taken: ret
>>
>> # slow path
>> slow: lock incb 1(%ebx) # inc count
>>
>> 1: rep;nop
>> cmpb $0,(%ebx)
>> jnz 1b # wait for unlocked
>>
>> movb $1,%al # attempt to take lock (count already increased)
>> xchgb %al,(%ebx)
>> testb %al,%al
>> jnz 1b
>>
>> lock decb 1(%ebx) # drop count
>> jmp taken
>>
>> spin_unlock:
>> movb $0,(%ebx)
>> ret
>>
>>
>> The uncontended fastpath is similar to the pre-ticket locks, but it
>> refuses to take the lock if there are other waiters, even if the lock is
>> not currently held. This prevents the rapid lock-unlock cycle on one
>> CPU from starving another CPU, which I understand was the original
>> problem tickets locks were trying to solve.
>>
>
> They prevent lots of unfairness and starvation problems. The most
> prominent one (ie. actually observed in Linux) was a single CPU
> being totally starved by N others (to the point where lockup timers
> would kick in).
>
Yep. My understanding was that the specific case was that cpu A was
repeatedly taking and releasing the lock, while other cpus spin waiting
for it, and that the cache coherence logic kept the cacheline owned by A
(presumably because it kept modifying it). Ticket locks work well in
this case because, as you say, it enforces a fairness policy that the
hardware doesn't implement. Are there other cases that ticket locks
help with? Does the algorithm above solve the starvation issue?
> As an aside, these locks you propose are also a lot more costly in
> the contended path. 4 vs 1 atomic operations on the lock cacheline
> is not so great.
>
Yep, that's not great. But it doesn't bounce cache lines around as much
either, so perhaps it doesn't make much difference.
But really, I was being my own devil's advocate, to see if there's some
other lock algorithm which satisfies both the normal ticket-lock case
and the virtualization case.
I have no real objections to ticket locks, so long as I can turn them off ;)
J
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-07-08 5:57 ` Jeremy Fitzhardinge
@ 2008-07-08 8:41 ` Nick Piggin
2008-07-08 15:58 ` Jeremy Fitzhardinge
0 siblings, 1 reply; 35+ messages in thread
From: Nick Piggin @ 2008-07-08 8:41 UTC (permalink / raw)
To: Jeremy Fitzhardinge
Cc: Rik van Riel, Peter Zijlstra, Christoph Lameter, Petr Tesarik,
Ingo Molnar, linux-kernel
On Tuesday 08 July 2008 15:57, Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:
> > They prevent lots of unfairness and starvation problems. The most
> > prominent one (ie. actually observed in Linux) was a single CPU
> > being totally starved by N others (to the point where lockup timers
> > would kick in).
>
> Yep. My understanding was that the specific case was that cpu A was
> repeatedly taking and releasing the lock, while other cpus spin waiting
> for it, and that the cache coherence logic kept the cacheline owned by A
> (presumably because it kept modifying it).
In that case, the CPUs I've tested seem to do a reasonable job of
avoiding hogging the lock. And they all obviously make something of
an attempt at starvation/fairness of simple cacheline access...
But spinlocks require cacheline access *at exactly the right time*
in order to make progress. So simple cacheline fairness can break
down quite easily.
> Ticket locks work well in
> this case because, as you say, it enforces a fairness policy that the
> hardware doesn't implement. Are there other cases that ticket locks
> help with?
Well just starvation and fairness mainly.
Interestingly, it actually has some better scalability properties under
high contention sometimes (because the lock aquisition slowpath requires
no further stores to the cacheline).
If a single CPU is allowed to take and release a contended lock
multiple times, I don't consider that in itself as a bad property of
a lock implementation. I just reasoned that (maybe a little counter
intuitively) it isn't such an important thing to have either...
> Does the algorithm above solve the starvation issue?
I don't believe so because there is still no guarantee you get out of
the slowpath (it's fairly similar to old spinlocks in that regard).
> > As an aside, these locks you propose are also a lot more costly in
> > the contended path. 4 vs 1 atomic operations on the lock cacheline
> > is not so great.
>
> Yep, that's not great. But it doesn't bounce cache lines around as much
> either, so perhaps it doesn't make much difference.
>
> But really, I was being my own devil's advocate, to see if there's some
> other lock algorithm which satisfies both the normal ticket-lock case
> and the virtualization case.
Yeah, I'm not sure. I expect it would be hard to match ticket lock's
simplicity and guarantees.
> I have no real objections to ticket locks, so long as I can turn them off
> ;)
Sure. Btw. I have no problems with your patchset, but if SMP guests
are seriously used on x86, just remember the fairness issue. At some
point you might find you'll need an even more advanced lock for Xen.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable
2008-07-08 8:41 ` Nick Piggin
@ 2008-07-08 15:58 ` Jeremy Fitzhardinge
0 siblings, 0 replies; 35+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-08 15:58 UTC (permalink / raw)
To: Nick Piggin
Cc: Rik van Riel, Peter Zijlstra, Christoph Lameter, Petr Tesarik,
Ingo Molnar, linux-kernel
Nick Piggin wrote:
> Sure. Btw. I have no problems with your patchset, but if SMP guests
> are seriously used on x86, just remember the fairness issue. At some
> point you might find you'll need an even more advanced lock for Xen.
>
Perhaps. The spin lock has a bounded number of iterations before it
goes into the vcpu-blocking path. At that point, it's no longer subject
to the whims of the cache coherency protocol, and is explicitly woken.
But it may still have issues when competing with a rapid lock-unlocker.
A ticket-lock variant with some kind of directed yield might be the way
to go if it turns out to be a problem.
J
^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2008-07-08 15:58 UTC | newest]
Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-06 19:26 Spinlocks waiting with interrupts disabled / preempt disabled Christoph Lameter
2008-05-07 7:30 ` Ingo Molnar
2008-05-07 17:04 ` Christoph Lameter
2008-05-07 17:24 ` Christoph Lameter
2008-05-07 18:49 ` Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disable Christoph Lameter
2008-05-09 10:26 ` Ingo Molnar
2008-05-09 16:28 ` Christoph Lameter
2008-06-23 17:19 ` Petr Tesarik
2008-06-23 17:54 ` Peter Zijlstra
2008-06-23 18:20 ` Christoph Lameter
2008-06-23 20:39 ` Peter Zijlstra
2008-06-23 20:45 ` Christoph Lameter
2008-06-23 20:58 ` Peter Zijlstra
2008-06-26 2:51 ` Jeremy Fitzhardinge
2008-06-26 6:51 ` Peter Zijlstra
2008-06-26 15:49 ` Jeremy Fitzhardinge
2008-06-26 9:17 ` Petr Tesarik
2008-06-26 17:02 ` Christoph Lameter
2008-06-26 17:48 ` Peter Zijlstra
2008-07-07 11:50 ` Nick Piggin
2008-07-07 11:52 ` Nick Piggin
2008-07-07 15:56 ` Jeremy Fitzhardinge
2008-07-08 2:08 ` Nick Piggin
2008-07-07 15:53 ` Jeremy Fitzhardinge
2008-07-07 19:46 ` Rik van Riel
2008-07-07 20:14 ` Jeremy Fitzhardinge
2008-07-08 2:07 ` Nick Piggin
2008-07-08 5:57 ` Jeremy Fitzhardinge
2008-07-08 8:41 ` Nick Piggin
2008-07-08 15:58 ` Jeremy Fitzhardinge
2008-05-09 16:35 ` Spinlocks waiting with interrupts disabled / preempt disabled Olaf Weber
2008-05-09 17:56 ` Peter Zijlstra
2008-05-09 18:00 ` Christoph Lameter
2008-05-09 18:06 ` Peter Zijlstra
2008-05-09 20:01 ` Olaf Weber
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox