Re: [patch] increase spinlock-debug looping timeouts (write

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [patch] increase spinlock-debug looping timeouts (write_lock and NMI)
       [not found] ` <fa.Zp589GPrIISmAAheRowfRgZ1jgs@ifi.uio.no>
@ 2006-06-20  5:35   ` Dave Olson
  2006-06-20  6:39     ` Andrew Morton
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Olson @ 2006-06-20  5:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ingo Molnar, ccb, linux-kernel

On Mon, 19 Jun 2006, Andrew Morton wrote:

| On Mon, 19 Jun 2006 13:39:44 +0200
| Ingo Molnar <mingo@elte.hu> wrote:
| 
| > 
| > * Andrew Morton <akpm@osdl.org> wrote:
| > 
| > > > The write_trylock + __delay in the loop is not a problem or a bug, as 
| > > > the trylock will at most _increase_ the delay - and our goal is to not 
| > > > have a false positive, not to be absolutely accurate about the 
| > > > measurement here.
| > > 
| > > Precisely.  We have delays of over a second (but we don't know how 
| > > much more than a second).  Let's say two seconds.  The NMI watchdog 
| > > timeout is, what?  Five seconds?
| > 
| > i dont see the problem.
| 
| It's taking over a second to acquire a write_lock.  A lock which is
| unlikely to be held for more than a microsecond anywhere.  That's really
| bad, isn't it?  Being on the edge of an NMI watchdog induced system crash
| is bad, too.
| 
| > We'll have tried that lock hundreds of thousands 
| > of times before this happens. The NMI watchdog will only trigger if we 
| > do this with IRQs disabled.
| 
| tree_lock uses write_lock_irq().
| 
| > And it's not like the normal 
| > __write_lock_failed codepath would be any different: for heavily 
| > contended workloads the overhead is likely in the cacheline bouncing, 
| > not in the __delay().
| 
| Yes, it might also happen with !CONFIG_DEBUG_SPINLOCK.  We need to find out
| if that's so and if so, why.
| 
| > > That's getting too close.  The result will be a total system crash.  
| > > And RH are shipping this.
| > 
| > I dont see a connection. Pretty much the only thing the loop condition 
| > impacts is the condition under which we print out a 'i think we 
| > deadlocked' message.
| 
| I'm assuming that the additional delay in the debug code has worsened the
| situation.
| 
| > Have i missed your point perhaps?
| 
| I get that impression ;) If it takes 1-2 seconds to get this lock then it
| can take five seconds.  a) that's just gross and b) the NMI watchdog will
| nuke the box.
| 
| Why is it taking so long to get the lock?
| 
| Does it happen in non-debug mode?
| 
| What do we do about it?

It seems possible that this might be the cause of problems we've had
with our InfiniPath hardware/software, and also Mellanox/OpenIB hardware/software
on some quad-socket/dual core opteron systems (8 cpu cores).

We'll see very long delays when 8 MPI processes exit "simultaneously", and sometimes
get NMI, sometimes system hangs, and sometimes just hung up for many seconds (and
often in that state, doing sysrq-P or sysrq-T will make things happy again).

A typical trace looks like this (on an fc4 2.6.16 kernel):

[root@quad-00 ~]# NMI Watchdog detected LOCKUP on CPU 0
CPU 0                                                  
Modules linked in: nfs nfsd exportfs lockd nfs_acl ipv6 autofs4 sunrpc ib_sdp(U)
ib_cm(U) ib_umad(U) ib_uverbs(U) ib_ipoib(U) ib_sa(U) ib_ipath(U) ib_mad(U)
ib_core(U) video button battery ac i2c_nforce2 i2c_core ipath_core(U) e1000
floppy sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod sata_nv libata aic79xx
scsi_transport_spi sd_mod scsi_mod
Pid: 4239, comm: mpi_multibw Not tainted 2.6.16-1.2096_FC4.rootsmp #1
RIP: 0010:[<ffffffff80213a30>] <ffffffff80213a30>{_raw_write_lock+161}
RSP: 0018:ffff810078e07c18  EFLAGS: 00000086                          
RAX: 000000008f100300 RBX: ffff81007b7bea58 RCX: 00000000002dc5a0
RDX: 0000000000927efd RSI: 0000000000000001 RDI: ffff81007b7bea58
RBP: ffff81007b7bea40 R08: ffff810002e3ae80 R09: 00000000fffffffa
R10: 0000000000000003 R11: ffffffff801644e2 R12: ffff81007b7bea58
R13: 00002aaaad800000 R14: ffff810002e3aec0 R15: 00002aaabba6f000
FS:  0000000040a00960(0000) GS:ffffffff80514000(0000) knlGS:00000000f7fc86c0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b                           
CR2: 00000033f38bdaf0 CR3: 0000000000101000 CR4: 00000000000006e0
Process mpi_multibw (pid: 4239, threadinfo ffff810078e06000, task ffff810079d8a040)
Stack: ffff810002e3aec0 ffffffff8016452b 0000000078ebb067 00002aaaad757000 
       ffff810078dccab8 ffffffff8016b840 0000000000000000 ffff810078e07d38 
       ffffffffffffffff 0000000000000000                                   
Call Trace: <ffffffff8016452b>{__set_page_dirty_nobuffers+73}
       <ffffffff8016b840>{unmap_vmas+1042} <ffffffff8016e638>{exit_mmap+124}
       <ffffffff80132b07>{mmput+37} <ffffffff80138373>{do_exit+584}         
       <ffffffff801416dc>{__dequeue_signal+459} <ffffffff80138af0>{sys_exit_group+0}
       <ffffffff80142af3>{get_signal_to_deliver+1568}
<ffffffff8010a14a>{do_signal+116}
       <ffffffff80195dc1>{__pollwait+0} <ffffffff80196b0c>{sys_select+934}
       <ffffffff8010aa87>{sysret_signal+28}
<ffffffff8010ad73>{ptregscall_common+103}

Code: 84 c0 75 7f f0 81 03 00 00 00 01 f3 90 48 83 c1 01 48 8b 15 
Kernel panic - not syncing: nmi watchdog               

And a sysrq-P often looks like this:

SysRq : Show Regs
CPU 0:
Modules linked in: nfs nfsd exportfs lockd nfs_acl ipv6 autofs4 sunrpc ib_sdp(U)
ib_cm(U) ib_umad(U) ib_uverbs(U) ib_ipoib(U) ib_sa(U) ib_ipath(U) ib_mad(U)
ib_core(U) video button battery ac i2c_nforce2 i2c_core ipath_core(U) e1000
floppy sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod sata_nv libata aic79xx
scsi_transport_spi sd_mod scsi_mod
Pid: 3702, comm: mpi_multibw Not tainted 2.6.16-1.2096_FC4.rootsmp #1
RIP: 0010:[<ffffffff8013b0fe>] <ffffffff8013b0fe>{__do_softirq+81}
RSP: 0018:ffffffff8048f868  EFLAGS: 00000206
RAX: 0000000000000022 RBX: 0000000000000022 RCX: 0000000000000300
RDX: 0000000000000000 RSI: 00000000000000c0 RDI: ffff81007a60a860
RBP: ffffffff8052bf80 R08: 0000000000000211 R09: 0000000000000008
R10: ffff8100dff92ac8 R11: 0000000000000000 R12: ffffffff8057ad80
R13: 0000000000000000 R14: 000000000000000a R15: 00002aaabba6f000
FS:  00002aaaabc5bd60(0000) GS:ffffffff80514000(0000) knlGS:00000000f7f9dbb0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaab9b1af0 CR3: 00000000de70a000 CR4: 00000000000006e0

Call Trace: <IRQ> <ffffffff8010be46>{call_softirq+30}
       <ffffffff8010cdcc>{do_softirq+44}
<ffffffff8010b7a0>{apic_timer_interrupt+132} <EOI>
       <ffffffff80354016>{_write_unlock_irq+14}
<ffffffff80164599>{__set_page_dirty_nobuffers+183}
       <ffffffff8016b840>{unmap_vmas+1042} <ffffffff8016e638>{exit_mmap+124}
       <ffffffff80132b07>{mmput+37} <ffffffff80138373>{do_exit+584}
       <ffffffff801416dc>{__dequeue_signal+459} <ffffffff80138af0>{sys_exit_group+0}
       <ffffffff80142af3>{get_signal_to_deliver+1568}
<ffffffff8010a14a>{do_signal+116}
       <ffffffff80195dc1>{__pollwait+0} <ffffffff80196b0c>{sys_select+934}
       <ffffffff8010aa87>{sysret_signal+28}
<ffffffff8010ad73>{ptregscall_common+103}

Dave Olson
olson@unixfolk.com
http://www.unixfolk.com/dave

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts (write_lock and NMI)
  2006-06-20  5:35   ` [patch] increase spinlock-debug looping timeouts (write_lock and NMI) Dave Olson
@ 2006-06-20  6:39     ` Andrew Morton
  2006-06-20  6:53       ` Dave Jones
                         ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Andrew Morton @ 2006-06-20  6:39 UTC (permalink / raw)
  To: Dave Olson; +Cc: mingo, ccb, linux-kernel, Nick Piggin

On Mon, 19 Jun 2006 22:35:46 -0700 (PDT)
Dave Olson <olson@unixfolk.com> wrote:

> | 
> | I get that impression ;) If it takes 1-2 seconds to get this lock then it
> | can take five seconds.  a) that's just gross and b) the NMI watchdog will
> | nuke the box.
> | 
> | Why is it taking so long to get the lock?
> | 
> | Does it happen in non-debug mode?
> | 
> | What do we do about it?
> 
> It seems possible that this might be the cause of problems we've had
> with our InfiniPath hardware/software, and also Mellanox/OpenIB hardware/software
> on some quad-socket/dual core opteron systems (8 cpu cores).
> 
> We'll see very long delays when 8 MPI processes exit "simultaneously", and sometimes
> get NMI, sometimes system hangs, and sometimes just hung up for many seconds (and
> often in that state, doing sysrq-P or sysrq-T will make things happy again).
> 

OK.  I assume these processes have done a mmap(MAP_SHARED) of a lot of
memory?

> A typical trace looks like this (on an fc4 2.6.16 kernel):

fc4?  You seem to have an RH-FCx which doesn't enable
CONFIG_DEBUG_SPINLOCK.  Or maybe we didn't have all that debug code in
2.6.16.  Doesn't matter, really.

> [root@quad-00 ~]# NMI Watchdog detected LOCKUP on CPU 0
> CPU 0                                                  
> Modules linked in: nfs nfsd exportfs lockd nfs_acl ipv6 autofs4 sunrpc ib_sdp(U)
> ib_cm(U) ib_umad(U) ib_uverbs(U) ib_ipoib(U) ib_sa(U) ib_ipath(U) ib_mad(U)
> ib_core(U) video button battery ac i2c_nforce2 i2c_core ipath_core(U) e1000
> floppy sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod sata_nv libata aic79xx
> scsi_transport_spi sd_mod scsi_mod
> Pid: 4239, comm: mpi_multibw Not tainted 2.6.16-1.2096_FC4.rootsmp #1
> RIP: 0010:[<ffffffff80213a30>] <ffffffff80213a30>{_raw_write_lock+161}
> RSP: 0018:ffff810078e07c18  EFLAGS: 00000086                          
> RAX: 000000008f100300 RBX: ffff81007b7bea58 RCX: 00000000002dc5a0
> RDX: 0000000000927efd RSI: 0000000000000001 RDI: ffff81007b7bea58
> RBP: ffff81007b7bea40 R08: ffff810002e3ae80 R09: 00000000fffffffa
> R10: 0000000000000003 R11: ffffffff801644e2 R12: ffff81007b7bea58
> R13: 00002aaaad800000 R14: ffff810002e3aec0 R15: 00002aaabba6f000
> FS:  0000000040a00960(0000) GS:ffffffff80514000(0000) knlGS:00000000f7fc86c0
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b                           
> CR2: 00000033f38bdaf0 CR3: 0000000000101000 CR4: 00000000000006e0
> Process mpi_multibw (pid: 4239, threadinfo ffff810078e06000, task ffff810079d8a040)
> Stack: ffff810002e3aec0 ffffffff8016452b 0000000078ebb067 00002aaaad757000 
>        ffff810078dccab8 ffffffff8016b840 0000000000000000 ffff810078e07d38 
>        ffffffffffffffff 0000000000000000                                   
> Call Trace: <ffffffff8016452b>{__set_page_dirty_nobuffers+73}
>        <ffffffff8016b840>{unmap_vmas+1042} <ffffffff8016e638>{exit_mmap+124}
>        <ffffffff80132b07>{mmput+37} <ffffffff80138373>{do_exit+584}         
>        <ffffffff801416dc>{__dequeue_signal+459} <ffffffff80138af0>{sys_exit_group+0}
>        <ffffffff80142af3>{get_signal_to_deliver+1568}
> <ffffffff8010a14a>{do_signal+116}
>        <ffffffff80195dc1>{__pollwait+0} <ffffffff80196b0c>{sys_select+934}
>        <ffffffff8010aa87>{sysret_signal+28}
> <ffffffff8010ad73>{ptregscall_common+103}
>      
> Code: 84 c0 75 7f f0 81 03 00 00 00 01 f3 90 48 83 c1 01 48 8b 15 
> Kernel panic - not syncing: nmi watchdog               

blam, dead box, that's the one, thanks.

With our current rwlock semantics I don't know if this is fixable. 
Probably we need to go back to a spinlock on tree_lock.

With a -stable backport.  I suspect this is triggerable on demand.

<tries it, fails>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts (write_lock and NMI)
  2006-06-20  6:39     ` Andrew Morton
@ 2006-06-20  6:53       ` Dave Jones
  2006-06-20  7:37       ` Nick Piggin
  2006-06-20 16:11       ` Dave Olson
  2 siblings, 0 replies; 23+ messages in thread
From: Dave Jones @ 2006-06-20  6:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dave Olson, mingo, ccb, linux-kernel, Nick Piggin

On Mon, Jun 19, 2006 at 11:39:47PM -0700, Andrew Morton wrote:

 > fc4?  You seem to have an RH-FCx which doesn't enable
 > CONFIG_DEBUG_SPINLOCK.  Or maybe we didn't have all that debug code in
 > 2.6.16.  Doesn't matter, really.

>From the uname, it looks like a recompiled Fedora kernel
(probably with that option turned off).

 > > Pid: 4239, comm: mpi_multibw Not tainted 2.6.16-1.2096_FC4.rootsmp #1

We helpfully appended the whoami output to it at buildtime.

		Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts (write_lock and NMI)
  2006-06-20  6:39     ` Andrew Morton
  2006-06-20  6:53       ` Dave Jones
@ 2006-06-20  7:37       ` Nick Piggin
  2006-06-20  8:03         ` Andrew Morton
                           ` (2 more replies)
  2006-06-20 16:11       ` Dave Olson
  2 siblings, 3 replies; 23+ messages in thread
From: Nick Piggin @ 2006-06-20  7:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dave Olson, mingo, ccb, linux-kernel

Andrew Morton wrote:
> On Mon, 19 Jun 2006 22:35:46 -0700 (PDT)
> Dave Olson <olson@unixfolk.com> wrote:
> 
> 
>>| 
>>| I get that impression ;) If it takes 1-2 seconds to get this lock then it
>>| can take five seconds.  a) that's just gross and b) the NMI watchdog will
>>| nuke the box.
>>| 
>>| Why is it taking so long to get the lock?
>>| 
>>| Does it happen in non-debug mode?
>>| 
>>| What do we do about it?
>>
>>It seems possible that this might be the cause of problems we've had
>>with our InfiniPath hardware/software, and also Mellanox/OpenIB hardware/software
>>on some quad-socket/dual core opteron systems (8 cpu cores).
>>
>>We'll see very long delays when 8 MPI processes exit "simultaneously", and sometimes
>>get NMI, sometimes system hangs, and sometimes just hung up for many seconds (and
>>often in that state, doing sysrq-P or sysrq-T will make things happy again).
>>
> 
> 
> OK.  I assume these processes have done a mmap(MAP_SHARED) of a lot of
> memory?
> 
> 
>>A typical trace looks like this (on an fc4 2.6.16 kernel):
> 
> 
> fc4?  You seem to have an RH-FCx which doesn't enable
> CONFIG_DEBUG_SPINLOCK.  Or maybe we didn't have all that debug code in
> 2.6.16.  Doesn't matter, really.
> 
> 
>>[root@quad-00 ~]# NMI Watchdog detected LOCKUP on CPU 0
>>CPU 0                                                  
>>Modules linked in: nfs nfsd exportfs lockd nfs_acl ipv6 autofs4 sunrpc ib_sdp(U)
>>ib_cm(U) ib_umad(U) ib_uverbs(U) ib_ipoib(U) ib_sa(U) ib_ipath(U) ib_mad(U)
>>ib_core(U) video button battery ac i2c_nforce2 i2c_core ipath_core(U) e1000
>>floppy sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod sata_nv libata aic79xx
>>scsi_transport_spi sd_mod scsi_mod
>>Pid: 4239, comm: mpi_multibw Not tainted 2.6.16-1.2096_FC4.rootsmp #1
>>RIP: 0010:[<ffffffff80213a30>] <ffffffff80213a30>{_raw_write_lock+161}
>>RSP: 0018:ffff810078e07c18  EFLAGS: 00000086                          
>>RAX: 000000008f100300 RBX: ffff81007b7bea58 RCX: 00000000002dc5a0
>>RDX: 0000000000927efd RSI: 0000000000000001 RDI: ffff81007b7bea58
>>RBP: ffff81007b7bea40 R08: ffff810002e3ae80 R09: 00000000fffffffa
>>R10: 0000000000000003 R11: ffffffff801644e2 R12: ffff81007b7bea58
>>R13: 00002aaaad800000 R14: ffff810002e3aec0 R15: 00002aaabba6f000
>>FS:  0000000040a00960(0000) GS:ffffffff80514000(0000) knlGS:00000000f7fc86c0
>>CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b                           
>>CR2: 00000033f38bdaf0 CR3: 0000000000101000 CR4: 00000000000006e0
>>Process mpi_multibw (pid: 4239, threadinfo ffff810078e06000, task ffff810079d8a040)
>>Stack: ffff810002e3aec0 ffffffff8016452b 0000000078ebb067 00002aaaad757000 
>>       ffff810078dccab8 ffffffff8016b840 0000000000000000 ffff810078e07d38 
>>       ffffffffffffffff 0000000000000000                                   
>>Call Trace: <ffffffff8016452b>{__set_page_dirty_nobuffers+73}
>>       <ffffffff8016b840>{unmap_vmas+1042} <ffffffff8016e638>{exit_mmap+124}
>>       <ffffffff80132b07>{mmput+37} <ffffffff80138373>{do_exit+584}         
>>       <ffffffff801416dc>{__dequeue_signal+459} <ffffffff80138af0>{sys_exit_group+0}
>>       <ffffffff80142af3>{get_signal_to_deliver+1568}
>><ffffffff8010a14a>{do_signal+116}
>>       <ffffffff80195dc1>{__pollwait+0} <ffffffff80196b0c>{sys_select+934}
>>       <ffffffff8010aa87>{sysret_signal+28}
>><ffffffff8010ad73>{ptregscall_common+103}
>>     
>>Code: 84 c0 75 7f f0 81 03 00 00 00 01 f3 90 48 83 c1 01 48 8b 15 
>>Kernel panic - not syncing: nmi watchdog 

Any ideas what it might be waiting on?


> 
> 
> blam, dead box, that's the one, thanks.
> 
> With our current rwlock semantics I don't know if this is fixable. 
> Probably we need to go back to a spinlock on tree_lock.

Lockless pagecache makes most of the readside locks go away, so I have
converted tree_lock back to a spinlock in my tree. I've just started
working on it again with a view for submitting it (or at least the
RCU radix tree, to start with)... been having fun with a userspace RCU
for rtth ;)

Otherwise, a straight rwlock->spinlock conversion will have a few more
scalability issues, but I'd guess it wouldn't be a problem  at all for
most workloads on most systems.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts (write_lock and NMI)
  2006-06-20  7:37       ` Nick Piggin
@ 2006-06-20  8:03         ` Andrew Morton
  2006-06-20  8:33         ` Ingo Molnar
  2006-06-20  8:43         ` Arjan van de Ven
  2 siblings, 0 replies; 23+ messages in thread
From: Andrew Morton @ 2006-06-20  8:03 UTC (permalink / raw)
  To: Nick Piggin; +Cc: olson, mingo, ccb, linux-kernel

On Tue, 20 Jun 2006 17:37:32 +1000
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> >>Kernel panic - not syncing: nmi watchdog 
> 
> Any ideas what it might be waiting on?

Readers, I guess.  When there's a spinning writer, nothing prevents _new_
readers from getting into the read-locked region.  If we have enough
readers and the read-locked region is long enough, there's always at least
one reader in the read-locked region and the writer is permanently starved.

> Otherwise, a straight rwlock->spinlock conversion will have a few more
> scalability issues, but I'd guess it wouldn't be a problem  at all for
> most workloads on most systems.

It beats crashing.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts (write_lock and NMI)
  2006-06-20  7:37       ` Nick Piggin
  2006-06-20  8:03         ` Andrew Morton
@ 2006-06-20  8:33         ` Ingo Molnar
  2006-06-20  9:37           ` Nick Piggin
  2006-06-20  8:43         ` Arjan van de Ven
  2 siblings, 1 reply; 23+ messages in thread
From: Ingo Molnar @ 2006-06-20  8:33 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Dave Olson, ccb, linux-kernel


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Otherwise, a straight rwlock->spinlock conversion will have a few more 
> scalability issues, but I'd guess it wouldn't be a problem at all for 
> most workloads on most systems.

curious, do you have any (relatively-) simple to run testcase that 
clearly shows the "scalability issues" you mention above, when going 
from rwlocks to spinlocks? I'd like to give it a try on an 8-way box.

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts (write_lock and NMI)
  2006-06-20  8:33         ` Ingo Molnar
@ 2006-06-20  9:37           ` Nick Piggin
  2006-06-20  9:51             ` Ingo Molnar
  0 siblings, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2006-06-20  9:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Dave Olson, ccb, linux-kernel, Peter Chubb,
	Arjan van de Ven

Ingo Molnar wrote:
> curious, do you have any (relatively-) simple to run testcase that 
> clearly shows the "scalability issues" you mention above, when going 
> from rwlocks to spinlocks? I'd like to give it a try on an 8-way box.

Arjan van de Ven wrote:
 > I'm curious what scalability advantage you see for rw spinlocks vs real
 > spinlocks ... since for any kind of moderate hold time the opposite is
 > expected ;)

It actually surprised me too, but Peter Chubb (who IIRC provided the
motivation to merge the patch) showed some fairly significant improvement
at 12-way.

https://www.gelato.unsw.edu.au/archives/scalability/2005-March/000069.html

Not sure what exactly would be going on at 8-way and above. Single
threaded lock hold times for find_lock_page should be fairly short... At
a wild guess, I'd say average lock transfer times are creeping up to the
point that spin lockers are taking multiple cacheline transfers to obtain
the lock, and the interconnect is getting saturated (read lockers should
only need one cacheline transfer in the absense of write lockers).

I thought Peter had a wider range of test cases than just reaim, but
perhaps that was for demonstrating some other problem.

Before that, Bill Irwin made some noises about Oracle improvements with
their VLM mode... that's not such a simple one to reproduce.

I'm sure SGI would be mortified too, but that's a given ;)

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts (write_lock and NMI)
  2006-06-20  9:37           ` Nick Piggin
@ 2006-06-20  9:51             ` Ingo Molnar
  2006-06-20 10:59               ` Nick Piggin
  0 siblings, 1 reply; 23+ messages in thread
From: Ingo Molnar @ 2006-06-20  9:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Dave Olson, ccb, linux-kernel, Peter Chubb,
	Arjan van de Ven

* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Ingo Molnar wrote:
> >curious, do you have any (relatively-) simple to run testcase that 
> >clearly shows the "scalability issues" you mention above, when going 
> >from rwlocks to spinlocks? I'd like to give it a try on an 8-way box.
> 
> Arjan van de Ven wrote:
> > I'm curious what scalability advantage you see for rw spinlocks vs real
> > spinlocks ... since for any kind of moderate hold time the opposite is
> > expected ;)
> 
> It actually surprised me too, but Peter Chubb (who IIRC provided the 
> motivation to merge the patch) showed some fairly significant 
> improvement at 12-way.
> 
> https://www.gelato.unsw.edu.au/archives/scalability/2005-March/000069.html

i think that workload wasnt analyzed well enough (by us, not by Peter, 
who sent a reasonable analysis and suggested a reasonable change), and 
we went with whatever magic change appeared to make a difference, 
without fully understanding the underlying reasons. Quote:

  "I'm not sure what's happening in the 4-processor case."

Now history appears to be repeating itself, just in the other direction 
;) And we didnt get one inch closer to understanding the situation for 
real. I'd vote for putting a change-moratorium on tree-lock and only 
allow a patch that tweaks it that fully analyzes the workload :-)

one thing off the top of my mind: doesnt lockstat introduce significant 
overhead? Is this reproducable with lockstat turned off too? Is the same 
scalability problem visible if all read_lock()s are changed to 
write_lock()? [like i did in my patch] I.e. can other explanations (like 
unlucky alignment of certain rwlock data structures / functions) be 
excluded.

another thing: average hold times in the spinlock case on that workload 
are below 1 microsecond - probably on the range of cachemiss bounce 
costs on such a system. I.e. it's the worst possible case for a 
spinlock->rwlock conversion! The only reason i can believe this to make 
a difference are cycle level races and small random micro-differences 
that cause heavier bouncing in the spinlock workload but happen to avoid 
it in the read-lock case. Not due to any fundamental advantage of 
rwlocks.

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts (write_lock and NMI)
  2006-06-20  9:51             ` Ingo Molnar
@ 2006-06-20 10:59               ` Nick Piggin
  2006-06-20 13:04                 ` Arjan van de Ven
  0 siblings, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2006-06-20 10:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Dave Olson, ccb, linux-kernel, Peter Chubb,
	Arjan van de Ven

[Corrected Arjan's address I messed up earlier.]

Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>Ingo Molnar wrote:
>>
>>>curious, do you have any (relatively-) simple to run testcase that 
>>>clearly shows the "scalability issues" you mention above, when going 
>>
>>>from rwlocks to spinlocks? I'd like to give it a try on an 8-way box.
>>
>>Arjan van de Ven wrote:
>>
>>>I'm curious what scalability advantage you see for rw spinlocks vs real
>>>spinlocks ... since for any kind of moderate hold time the opposite is
>>>expected ;)
>>
>>It actually surprised me too, but Peter Chubb (who IIRC provided the 
>>motivation to merge the patch) showed some fairly significant 
>>improvement at 12-way.
>>
>>https://www.gelato.unsw.edu.au/archives/scalability/2005-March/000069.html
> 
> 
> i think that workload wasnt analyzed well enough (by us, not by Peter, 
> who sent a reasonable analysis and suggested a reasonable change), and 
> we went with whatever magic change appeared to make a difference, 
> without fully understanding the underlying reasons. Quote:
> 
>   "I'm not sure what's happening in the 4-processor case."
> 
> Now history appears to be repeating itself, just in the other direction 
> ;) And we didnt get one inch closer to understanding the situation for 
> real. I'd vote for putting a change-moratorium on tree-lock and only 
> allow a patch that tweaks it that fully analyzes the workload :-)
> 
> one thing off the top of my mind: doesnt lockstat introduce significant 
> overhead? Is this reproducable with lockstat turned off too? Is the same 
> scalability problem visible if all read_lock()s are changed to 
> write_lock()? [like i did in my patch] I.e. can other explanations (like 
> unlucky alignment of certain rwlock data structures / functions) be 
> excluded.

Yes, it would need re-testing.

> 
> another thing: average hold times in the spinlock case on that workload 
> are below 1 microsecond - probably on the range of cachemiss bounce 
> costs on such a system. 

It's the wait time that I'd be more worried about. As I said, my wild
guess is that the wait times are creeping up.

> I.e. it's the worst possible case for a 
> spinlock->rwlock conversion! The only reason i can believe this to make 
> a difference are cycle level races and small random micro-differences 
> that cause heavier bouncing in the spinlock workload but happen to avoid 
> it in the read-lock case. Not due to any fundamental advantage of 
> rwlocks.

I'd say the 12 way results show that there is a fundamental advantage
(although that's pending whether or not lockstat is wrecking the results).
I'd even go out on a limb ;) and say that it will only become more
pronounced at higher cpu counts.

Correct me if I'm wrong, but... a read-lock requires at most a single
cacheline transfer per lock acq and a single per release, no matter the
concurrency on the lock (so long as it is read only).

A spinlock is going to take more. If the hardware perfectly round-robins
the cacheline, it will take lockers+1 transfers per lock+unlock. Of
course, hardware might be pretty unfair for efficiency, but there will
still be some probability of the cacheline bouncing to other lockers
while it is locked. And that probability will increase proportionally to
the number of lockers.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts (write_lock and NMI)
  2006-06-20 10:59               ` Nick Piggin
@ 2006-06-20 13:04                 ` Arjan van de Ven
  2006-06-20 13:28                   ` update pci device id cckuo
  2006-06-20 13:36                   ` [patch] increase spinlock-debug looping timeouts (write_lock and NMI) Nick Piggin
  0 siblings, 2 replies; 23+ messages in thread
From: Arjan van de Ven @ 2006-06-20 13:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Andrew Morton, Dave Olson, ccb, linux-kernel,
	Peter Chubb


> Correct me if I'm wrong, but... a read-lock requires at most a single
> cacheline transfer per lock acq and a single per release, no matter the
> concurrency on the lock (so long as it is read only).
> 
> A spinlock is going to take more. If the hardware perfectly round-robins
> the cacheline, it will take lockers+1 transfers per lock+unlock.

This is a bit too simplistic view; shared cachelines are cheap, it's
getting the cacheline exclusive (or transitioning to/from exclusive)
that is the expensive part...

(note that our spinlocks are fixed nowadays to only do the slowpath side
of things for read, eg allow shared cachelines there)



^ permalink raw reply	[flat|nested] 23+ messages in thread

* update pci device id
  2006-06-20 13:04                 ` Arjan van de Ven
@ 2006-06-20 13:28                   ` cckuo
  2006-06-20 14:06                     ` Arjan van de Ven
  2006-06-20 13:36                   ` [patch] increase spinlock-debug looping timeouts (write_lock and NMI) Nick Piggin
  1 sibling, 1 reply; 23+ messages in thread
From: cckuo @ 2006-06-20 13:28 UTC (permalink / raw)
  To: linux-kernel

Dear All:
Recently my company, sis, releases some new platforms for intel socket 775,
and AMD socket 939. I have read the MAINTAINERS and cannot find someone whom
I can let him help me to add the pci device id. 
If someone knows who takes charge of this part, please let me know.

Thanks a lot!!
cckuo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: update pci device id
  2006-06-20 13:28                   ` update pci device id cckuo
@ 2006-06-20 14:06                     ` Arjan van de Ven
  0 siblings, 0 replies; 23+ messages in thread
From: Arjan van de Ven @ 2006-06-20 14:06 UTC (permalink / raw)
  To: cckuo; +Cc: linux-kernel

On Tue, 2006-06-20 at 21:28 +0800, cckuo wrote:
> Dear All:
> Recently my company, sis, releases some new platforms for intel socket 775,
> and AMD socket 939. I have read the MAINTAINERS and cannot find someone whom
> I can let him help me to add the pci device id. 
> If someone knows who takes charge of this part, please let me know.

Hi,

the answer is a bit complex; I'll try to break it down in steps ...

step 1) the text database as used by lspci
This is quite easy, just go to http://pciids.sourceforge.net/ and add
your pci ids to the database

step 2) the drivers
this is a bit more work, basically this is about teaching individual
drivers that support your hardware about your new IDs.

for example the i810-like audio driver has a table like this:

 static struct pci_device_id snd_intel8x0_ids[] __devinitdata = {
        { 0x8086, 0x2415, PCI_ANY_ID, PCI_ANY_ID, 0, 0, DEVICE_INTEL }, /* 82801AA */
        { 0x8086, 0x2425, PCI_ANY_ID, PCI_ANY_ID, 0, 0, DEVICE_INTEL }, /* 82901AB */
        { 0x8086, 0x2445, PCI_ANY_ID, PCI_ANY_ID, 0, 0, DEVICE_INTEL }, /* 82801BA */
        { 0x8086, 0x2485, PCI_ANY_ID, PCI_ANY_ID, 0, 0, DEVICE_INTEL }, /* ICH3 */


if your hw is compatible, just add your ID, test the new river (since
you have the hardware) and send a patch to the driver mainter and/or the
lkml mailing list

Does this answer your question?

Greetings,
   Arjan van de Ven


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts (write_lock and NMI)
  2006-06-20 13:04                 ` Arjan van de Ven
  2006-06-20 13:28                   ` update pci device id cckuo
@ 2006-06-20 13:36                   ` Nick Piggin
  2006-06-20 14:53                     ` Arjan van de Ven
  1 sibling, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2006-06-20 13:36 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Andrew Morton, Dave Olson, ccb, linux-kernel,
	Peter Chubb

Arjan van de Ven wrote:
>>Correct me if I'm wrong, but... a read-lock requires at most a single
>>cacheline transfer per lock acq and a single per release, no matter the
>>concurrency on the lock (so long as it is read only).
>>
>>A spinlock is going to take more. If the hardware perfectly round-robins
>>the cacheline, it will take lockers+1 transfers per lock+unlock.
> 
> 
> This is a bit too simplistic view; shared cachelines are cheap, it's
> getting the cacheline exclusive (or transitioning to/from exclusive)
> that is the expensive part...

Taking the lock is going to transiation the cacheline to exclusive. If
the next locker tries to take the lock, they transfer the cacheline and
exclusive access and fail. If they have already tried to take the lock
earlier, they might only request a readonly state, but it still requires
a cacheline transfer (which is the expensive part).

The only way it is simplistic is that hardware will be unfair and give
the same, or "close" requesters priority for some time, so the cacheline
stays close.

At some point, when it gets transferred away, there is no guarantee that
the spinlock will be unlocked. Quite likely the opposite, if there is
large contention for it and/or its cacheline.

> 
> (note that our spinlocks are fixed nowadays to only do the slowpath side
> of things for read, eg allow shared cachelines there)

To put it another way, when 1 CPU takes or releases the lock, the cachelines
of 11 others are invalidated. In a perfect round-robin, if 12 queue up at the
same time, 1 will go through and 11 will fail (= 12 cacheline transfers). So
in this situation, the reader lock has a factor of 12 better acquisition
throughput.

Now the situation is simplistic (all queueing at the same time, perfectly
fair hardware), but the cacheline transfer costs are accurate *for this
situation*.

So I think rwlocks do have a fundamental advantage over spinlocks (aside
from the multiple concurrent readers advantage, although the two properties
are obviously fundamentally related). It is yet to be shown whether that is
actually the cause of Peter's performance improvement, but that is my
guess.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts (write_lock and NMI)
  2006-06-20 13:36                   ` [patch] increase spinlock-debug looping timeouts (write_lock and NMI) Nick Piggin
@ 2006-06-20 14:53                     ` Arjan van de Ven
  2006-06-20 15:16                       ` Nick Piggin
  0 siblings, 1 reply; 23+ messages in thread
From: Arjan van de Ven @ 2006-06-20 14:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Andrew Morton, Dave Olson, ccb, linux-kernel,
	Peter Chubb


> Taking the lock is going to transiation the cacheline to exclusive. If
> the next locker tries to take the lock, they transfer the cacheline and
> exclusive access and fail. If they have already tried to take the lock
> earlier, they might only request a readonly state, but it still requires
> a cacheline transfer (which is the expensive part).

the "which is the expensive part" isn't entirely true on modern hardware
(and for sure not on multicore systems); due to various bus snooping
tricks and other "pass-the-cacheline" tricks this is relatively cheap;
not free obviously but not nearly as expensive as the exclusive part.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts (write_lock and NMI)
  2006-06-20 14:53                     ` Arjan van de Ven
@ 2006-06-20 15:16                       ` Nick Piggin
  2006-06-20 16:27                         ` Nick Piggin
  0 siblings, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2006-06-20 15:16 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Andrew Morton, Dave Olson, ccb, linux-kernel,
	Peter Chubb

Arjan van de Ven wrote:
>>Taking the lock is going to transiation the cacheline to exclusive. If
>>the next locker tries to take the lock, they transfer the cacheline and
>>exclusive access and fail. If they have already tried to take the lock
>>earlier, they might only request a readonly state, but it still requires
>>a cacheline transfer (which is the expensive part).
> 
> 
> the "which is the expensive part" isn't entirely true on modern hardware
> (and for sure not on multicore systems); due to various bus snooping
> tricks and other "pass-the-cacheline" tricks this is relatively cheap;
> not free obviously but not nearly as expensive as the exclusive part.

Yes, I meant the entire process of getting the cacheline. The cache
coherency is the larger part of the cost (except maybe with shared
cache multi cores), however you make it sound like getting exclusive
access is fundamentally more expensive than getting shared access.
(ok, once you *have* shared access, no problems, but *getting* it is
still expensive).

With broadcast snooping cache coherency with MESI, the "getting
exclusive" shouldn't be hugely more expensive than "getting shared".
Either way the owner has to write out the line and cause the requester
to retry iff it was dirty. Then, in the former case, the owner should
probably mark their line invalid, in the latter, just shared.

With MOESI, both cases will still have to do the broadcast snoop
AFAIK.

Not sure about fancier snoop filters or directory protocols. I can't
see why getting E would be that much more expensive than getting S in
this situation (sure, in situations where lots of entities are also in
S, getting an E might require more invalidates to be sent out...).

And either way, spinlocks are still much more costly than rwlocks,
because they still have that first exclusive request, who's
effectiveness deteriorates under load. That you *also* have these
follow on shared accesses (which will need to be invalidated somehow
later anyway), doesn't make them better than read locks.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts (write_lock and NMI)
  2006-06-20 15:16                       ` Nick Piggin
@ 2006-06-20 16:27                         ` Nick Piggin
  0 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2006-06-20 16:27 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Andrew Morton, Dave Olson, ccb, linux-kernel,
	Peter Chubb

Nick Piggin wrote:

> And either way, spinlocks are still much more costly than rwlocks,
> because they still have that first exclusive request, who's
> effectiveness deteriorates under load. That you *also* have these
> follow on shared accesses (which will need to be invalidated somehow
> later anyway), doesn't make them better than read locks.

I shouldn't say much more costly.... Much more costly when looking
at the limit case (and we traditionally rather look at the common
case in Linux, and in those cases spinlocks _can_ be faster).

However in the limit, spinlocks scale O(N), while readlocks scale
O(1), where N is the number of CPUs trying to take the lock.
AFAIKS.

But if they're causing stability problems, I have no arguments
against converting them to spinlocks. It might even result in a
worldwide net saving of CPU cycles, for what that's worth ;)

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts (write_lock and NMI)
  2006-06-20  7:37       ` Nick Piggin
  2006-06-20  8:03         ` Andrew Morton
  2006-06-20  8:33         ` Ingo Molnar
@ 2006-06-20  8:43         ` Arjan van de Ven
  2 siblings, 0 replies; 23+ messages in thread
From: Arjan van de Ven @ 2006-06-20  8:43 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Dave Olson, mingo, ccb, linux-kernel

On Tue, 2006-06-20 at 17:37 +1000,
> 
> Otherwise, a straight rwlock->spinlock conversion will have a few more
> scalability issues, but I'd guess it wouldn't be a problem  at all for
> most workloads on most systems.

I'm curious what scalability advantage you see for rw spinlocks vs real
spinlocks ... since for any kind of moderate hold time the opposite is
expected ;)



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts (write_lock and NMI)
  2006-06-20  6:39     ` Andrew Morton
  2006-06-20  6:53       ` Dave Jones
  2006-06-20  7:37       ` Nick Piggin
@ 2006-06-20 16:11       ` Dave Olson
  2006-06-20 21:10         ` Andrew Morton
  2 siblings, 1 reply; 23+ messages in thread
From: Dave Olson @ 2006-06-20 16:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mingo, ccb, linux-kernel, Nick Piggin

On Mon, 19 Jun 2006, Andrew Morton wrote:
| > We'll see very long delays when 8 MPI processes exit "simultaneously", and sometimes
| > get NMI, sometimes system hangs, and sometimes just hung up for many seconds (and
| > often in that state, doing sysrq-P or sysrq-T will make things happy again).
| > 
| 
| OK.  I assume these processes have done a mmap(MAP_SHARED) of a lot of
| memory?

Yep.  Some shared with kernel modules, some of device address space.

| > A typical trace looks like this (on an fc4 2.6.16 kernel):
| 
| fc4?  You seem to have an RH-FCx which doesn't enable
| CONFIG_DEBUG_SPINLOCK.  Or maybe we didn't have all that debug code in
| 2.6.16.  Doesn't matter, really.

Intended to be more or less stock fc4 but with CONFIG_PCI_MSI=y and
2.6.17-based patch so the 8131 MSI quirk isn't enabled.

>From the config file:
	CONFIG_DEBUG_SPINLOCK=y
	CONFIG_DEBUG_SPINLOCK_SLEEP=y

| With a -stable backport.  I suspect this is triggerable on demand.

So far we've only got the one test case, but it's quite reliable.
We hit one of the 3 cases (long > 60 seconds) "hangs" at exit,
NMI, or dead system hang, every time we run the test case (well,
perhaps 1 out of 20 times everything is "just fine", probably
something perturbs it enough to let one or more processes get
through the critical section ahead of the whole gang).

Dave Olson
olson@unixfolk.com
http://www.unixfolk.com/dave

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts (write_lock and NMI)
  2006-06-20 16:11       ` Dave Olson
@ 2006-06-20 21:10         ` Andrew Morton
  0 siblings, 0 replies; 23+ messages in thread
From: Andrew Morton @ 2006-06-20 21:10 UTC (permalink / raw)
  To: Dave Olson; +Cc: mingo, ccb, linux-kernel, nickpiggin

On Tue, 20 Jun 2006 09:11:36 -0700 (PDT)
Dave Olson <olson@unixfolk.com> wrote:

> On Mon, 19 Jun 2006, Andrew Morton wrote:
> | > We'll see very long delays when 8 MPI processes exit "simultaneously", and sometimes
> | > get NMI, sometimes system hangs, and sometimes just hung up for many seconds (and
> | > often in that state, doing sysrq-P or sysrq-T will make things happy again).
> | > 
> | 
> | OK.  I assume these processes have done a mmap(MAP_SHARED) of a lot of
> | memory?
> 
> Yep.  Some shared with kernel modules, some of device address space.
> 
> | > A typical trace looks like this (on an fc4 2.6.16 kernel):
> | 
> | fc4?  You seem to have an RH-FCx which doesn't enable
> | CONFIG_DEBUG_SPINLOCK.  Or maybe we didn't have all that debug code in
> | 2.6.16.  Doesn't matter, really.
> 
> Intended to be more or less stock fc4 but with CONFIG_PCI_MSI=y and
> 2.6.17-based patch so the 8131 MSI quirk isn't enabled.
> 
> >From the config file:
> 	CONFIG_DEBUG_SPINLOCK=y
> 	CONFIG_DEBUG_SPINLOCK_SLEEP=y

OK, I goofed again.

It would be super-interesting to know whether CONFIG_DEBUG_SPINLOCK=n
improves things.

> | With a -stable backport.  I suspect this is triggerable on demand.
> 
> So far we've only got the one test case, but it's quite reliable.
> We hit one of the 3 cases (long > 60 seconds) "hangs" at exit,
> NMI, or dead system hang, every time we run the test case (well,
> perhaps 1 out of 20 times everything is "just fine", probably
> something perturbs it enough to let one or more processes get
> through the critical section ahead of the whole gang).

Reproducability is a win.

You should have complained earlier!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts (write_lock and NMI)
@ 2006-06-22  5:45 Dave Olson
  2006-06-22  5:57 ` Andrew Morton
  2006-06-23  7:57 ` Ingo Molnar
  0 siblings, 2 replies; 23+ messages in thread
From: Dave Olson @ 2006-06-22  5:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mingo, ccb, linux-kernel, nickpiggin

On Tue, 20 Jun 2006, Andrew Morton wrote:
| > Intended to be more or less stock fc4 but with CONFIG_PCI_MSI=y and
| > 2.6.17-based patch so the 8131 MSI quirk isn't enabled.
| > 
| > >From the config file:
| > 	CONFIG_DEBUG_SPINLOCK=y
| > 	CONFIG_DEBUG_SPINLOCK_SLEEP=y
| 
| OK, I goofed again.
| 
| It would be super-interesting to know whether CONFIG_DEBUG_SPINLOCK=n
| improves things.

It does.   No stalls, hangs, or nmi's in several hours of running the
test that previously failed on almost every run (with long stalls, system
hangs, or NMI watchdogs), on the same hardware.

I made no other changes to the kernel config than turning both of
the above off.

Dave Olson
olson@unixfolk.com
http://www.unixfolk.com/dave

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts (write_lock and NMI)
  2006-06-22  5:45 Dave Olson
@ 2006-06-22  5:57 ` Andrew Morton
  2006-06-23  7:57 ` Ingo Molnar
  1 sibling, 0 replies; 23+ messages in thread
From: Andrew Morton @ 2006-06-22  5:57 UTC (permalink / raw)
  To: Dave Olson; +Cc: mingo, ccb, linux-kernel, nickpiggin

On Wed, 21 Jun 2006 22:45:49 -0700 (PDT)
Dave Olson <olson@unixfolk.com> wrote:

> On Tue, 20 Jun 2006, Andrew Morton wrote:
> | > Intended to be more or less stock fc4 but with CONFIG_PCI_MSI=y and
> | > 2.6.17-based patch so the 8131 MSI quirk isn't enabled.
> | > 
> | > >From the config file:
> | > 	CONFIG_DEBUG_SPINLOCK=y
> | > 	CONFIG_DEBUG_SPINLOCK_SLEEP=y
> | 
> | OK, I goofed again.
> | 
> | It would be super-interesting to know whether CONFIG_DEBUG_SPINLOCK=n
> | improves things.
> 
> It does.   No stalls, hangs, or nmi's in several hours of running the
> test that previously failed on almost every run (with long stalls, system
> hangs, or NMI watchdogs), on the same hardware.
> 
> I made no other changes to the kernel config than turning both of
> the above off.
> 

Well isn't that interesting, thanks.   We have our 2.6.17.x patch.

Now we need to work out why.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts (write_lock and NMI)
  2006-06-22  5:45 Dave Olson
  2006-06-22  5:57 ` Andrew Morton
@ 2006-06-23  7:57 ` Ingo Molnar
  1 sibling, 0 replies; 23+ messages in thread
From: Ingo Molnar @ 2006-06-23  7:57 UTC (permalink / raw)
  To: Dave Olson; +Cc: Andrew Morton, ccb, linux-kernel, nickpiggin


* Dave Olson <olson@unixfolk.com> wrote:

> | >     CONFIG_DEBUG_SPINLOCK=y
> | >     CONFIG_DEBUG_SPINLOCK_SLEEP=y
> | It would be super-interesting to know whether 
> | CONFIG_DEBUG_SPINLOCK=n improves things.
> 
> It does.  No stalls, hangs, or nmi's in several hours of running the 
> test that previously failed on almost every run (with long stalls, 
> system hangs, or NMI watchdogs), on the same hardware.
> 
> I made no other changes to the kernel config than turning both of the 
> above off.

we really need to figure out what's happening here! Could you re-enable 
spinlock debugging and try the patch below - do the stalls/lockups still 
happen?

	Ingo

---
 lib/spinlock_debug.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

Index: linux/lib/spinlock_debug.c
===================================================================
--- linux.orig/lib/spinlock_debug.c
+++ linux/lib/spinlock_debug.c
@@ -104,10 +104,10 @@ static void __spin_lock_debug(spinlock_t
 	u64 i;
 
 	for (;;) {
-		for (i = 0; i < loops_per_jiffy * HZ; i++) {
+		for (;;) {
 			if (__raw_spin_trylock(&lock->raw_lock))
 				return;
-			__delay(1);
+			cpu_relax();
 		}
 		/* lockup suspected: */
 		if (print_once) {
@@ -169,10 +169,10 @@ static void __read_lock_debug(rwlock_t *
 	u64 i;
 
 	for (;;) {
-		for (i = 0; i < loops_per_jiffy * HZ; i++) {
+		for (;;) {
 			if (__raw_read_trylock(&lock->raw_lock))
 				return;
-			__delay(1);
+			cpu_relax();
 		}
 		/* lockup suspected: */
 		if (print_once) {
@@ -242,10 +242,10 @@ static void __write_lock_debug(rwlock_t 
 	u64 i;
 
 	for (;;) {
-		for (i = 0; i < loops_per_jiffy * HZ; i++) {
+		for (;;) {
 			if (__raw_write_trylock(&lock->raw_lock))
 				return;
-			__delay(1);
+			cpu_relax();
 		}
 		/* lockup suspected: */
 		if (print_once) {

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts (write_lock and NMI)
@ 2006-06-23 16:27 Dave Olson
  0 siblings, 0 replies; 23+ messages in thread
From: Dave Olson @ 2006-06-23 16:27 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, ccb, linux-kernel, nickpiggin

On Fri, 23 Jun 2006, Ingo Molnar wrote:
| we really need to figure out what's happening here! Could you re-enable 
| spinlock debugging and try the patch below - do the stalls/lockups still 
| happen?

I'll do that, but the 2.6.16 FC4 kernel already has cpu_relax()
rather than __delay(1), so I'm about 95% certain that this patch
isn't going to help anything.   The NMI watchdog will still fire,
because all we are going to do is wait even longer, etc.

| ===================================================================
| --- linux.orig/lib/spinlock_debug.c
| +++ linux/lib/spinlock_debug.c
| @@ -104,10 +104,10 @@ static void __spin_lock_debug(spinlock_t
|  	u64 i;
|  
|  	for (;;) {
| -		for (i = 0; i < loops_per_jiffy * HZ; i++) {
| +		for (;;) {
|  			if (__raw_spin_trylock(&lock->raw_lock))
|  				return;
| -			__delay(1);
| +			cpu_relax();
|  		}

etc.

Dave Olson
olson@unixfolk.com
http://www.unixfolk.com/dave

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2006-06-23 16:27 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <fa.VT2rwoX1M/2O/aO5crhlRDNx4YA@ifi.uio.no>
     [not found] ` <fa.Zp589GPrIISmAAheRowfRgZ1jgs@ifi.uio.no>
2006-06-20  5:35   ` [patch] increase spinlock-debug looping timeouts (write_lock and NMI) Dave Olson
2006-06-20  6:39     ` Andrew Morton
2006-06-20  6:53       ` Dave Jones
2006-06-20  7:37       ` Nick Piggin
2006-06-20  8:03         ` Andrew Morton
2006-06-20  8:33         ` Ingo Molnar
2006-06-20  9:37           ` Nick Piggin
2006-06-20  9:51             ` Ingo Molnar
2006-06-20 10:59               ` Nick Piggin
2006-06-20 13:04                 ` Arjan van de Ven
2006-06-20 13:28                   ` update pci device id cckuo
2006-06-20 14:06                     ` Arjan van de Ven
2006-06-20 13:36                   ` [patch] increase spinlock-debug looping timeouts (write_lock and NMI) Nick Piggin
2006-06-20 14:53                     ` Arjan van de Ven
2006-06-20 15:16                       ` Nick Piggin
2006-06-20 16:27                         ` Nick Piggin
2006-06-20  8:43         ` Arjan van de Ven
2006-06-20 16:11       ` Dave Olson
2006-06-20 21:10         ` Andrew Morton
2006-06-22  5:45 Dave Olson
2006-06-22  5:57 ` Andrew Morton
2006-06-23  7:57 ` Ingo Molnar
  -- strict thread matches above, loose matches on Subject: below --
2006-06-23 16:27 Dave Olson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.