public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: deadlock with 2.6.9
@ 2004-11-13  3:45 Chuck Ebbert
  2004-11-15 20:23 ` Bill Davidsen
  0 siblings, 1 reply; 10+ messages in thread
From: Chuck Ebbert @ 2004-11-13  3:45 UTC (permalink / raw)
  To: Bernd Eckenfels; +Cc: linux-kernel

On Sun, 07 Nov 2004 at 09:22:14 +0100 Bernd Eckenfels wrote:

> > Get a real RAID controller (3Ware, not some crappy pseudo-RAID junk.)  They are
> > much more reliable than software RAID.
>
> On what sample do you base this claim?
>
> Generally hardware raid sooner or later makes problems (especially if you
> run raid5 in degenerate mode or try to rebuild by disk replacing with
> differen/old signature). Also bus hangs are commonly not very well received
> by hw raid firmware or drivers.

  I had 28 mirror sets on Compaq SMART2/p controllers in one server (four
controllers, two SCSI channels each, seven disks per channel.)  All the disks
on channel A of each controller were mirrored to those on channel B, so even
complete failure of one channel didn't cause a problem.

  Once a disk was marked 'failed' in the controller NVRAM there was no way to
convince it that some newly-inserted disk contained valid data.

  Booting up with SCSI cables connected wrongly (channels A and B swapped) got you
a nice error message informing you of this fact.  Swapping SCSI IDs on different
disks on the same channel was also detected and reported nicely.

  And attempting to boot with a bad cable (bent pin) gave a message saying 'either
power down NOW and check cables or I will mark every disk on that channel as
failed.'

  Of course this system was 100% Compaq; even the disks had Compaq firmware
though the labels said IBM.  And it was very expensive...


--Chuck Ebbert  12-Nov-04  22:42:32

^ permalink raw reply	[flat|nested] 10+ messages in thread
* Re: deadlock with 2.6.9
@ 2004-11-08  1:11 Chuck Ebbert
  0 siblings, 0 replies; 10+ messages in thread
From: Chuck Ebbert @ 2004-11-08  1:11 UTC (permalink / raw)
  To: Chris Stromsoe; +Cc: linux-kernel

On Sat, 6 Nov 2004 at 23:49:14 -0800 Chris Stromsoe wrote:


> Boot with nosmp or boot a kernel compiled for up?

nosmp should work, but I'd build a non-smp kernel to get rid of the
locking overhead.

>> Get a real RAID controller (3Ware, not some crappy pseudo-RAID junk.) 
>> They are much more reliable than software RAID.
>
> I've had more problems with reliable hardware raid controllers than I have 
> with software raid.  Your mileage may vary.

I've never had trouble with either kind, but recovery has been much less
painful with hardware RAID.  It takes just a few seconds to hotswap a drive.


--Chuck Ebbert  07-Nov-04  08:54:08

^ permalink raw reply	[flat|nested] 10+ messages in thread
* Re: deadlock with 2.6.9
@ 2004-11-07  5:55 Chuck Ebbert
  2004-11-07  7:49 ` Chris Stromsoe
  2004-11-07  8:22 ` Bernd Eckenfels
  0 siblings, 2 replies; 10+ messages in thread
From: Chuck Ebbert @ 2004-11-07  5:55 UTC (permalink / raw)
  To: Chris Stromsoe; +Cc: linux-kernel

Chris Stromsoe wrote:

> I had a third lockup, this time not related to burning a dvd.  As before, 
> the bulk of the processes that were hung were cron

 Why so many cron processes?  Is this normal on your system, or does it
look like cron keeps spawning processes because it gets no response on the
sockets?

> The box is P3 SMP

 Can you try a uniprocessor kernel?

> syslog logs to a stripe of two mirrors, built with mdadm.

 Get a real RAID controller (3Ware, not some crappy pseudo-RAID junk.)  They are
much more reliable than software RAID.


--Chuck Ebbert  07-Nov-04  00:28:44

^ permalink raw reply	[flat|nested] 10+ messages in thread
* deadlock with 2.6.9
@ 2004-11-02  7:03 Chris Stromsoe
  2004-11-04 17:01 ` Chris Stromsoe
  0 siblings, 1 reply; 10+ messages in thread
From: Chris Stromsoe @ 2004-11-02  7:03 UTC (permalink / raw)
  To: linux-kernel

The machine collects remote syslog, roughly 1000 packets per second. The 
logs are burned to dvd several times per week.  The hanging may have 
coincided with the completion of one of the log burning sessions.  Before 
the last crash I was using ide-scsi to do the burning.  Since the last 
crash, I switched over to directly using the ide device and have disabled 
ide-scsi.

The machine did not crash.  Remote access with ssh would hang after 
authentication.  Already running processes that didn't fork anything 
responded fine to the network.  I could not fork a shell from a serial 
console.  I was able to use sysrq to pull debugging information.  I was 
also able to kill off all running processes (sysrq+e and sysrq+i), then 
log in from the serial console and restart things.

I had the same problem with 2.6.8.1

Most of the processes seem to be stuck in schedule_timeout.

The full sysrq and .config are at http://hashbrown.cts.ucla.edu/deadlock/


cbs:~ > lsmod
Module                  Size  Used by
sg                     35040  0
sr_mod                 14884  0
cdrom                  38460  1 sr_mod
ide_scsi               15332  0
e100                   29760  0
bonding                64552  0



telnet> send brk
SysRq : Show Regs

Pid: 0, comm:              swapper
EIP: 0060:[<c01022cf>] CPU: 0
EIP is at default_idle+0x2f/0x40
  EFLAGS: 00000246    Not tainted  (2.6.9)
EAX: 00000000 EBX: c039b000 ECX: c01022a0 EDX: c039b000
ESI: c04080a0 EDI: c04081a0 EBP: c039bfc4 DS: 007b ES: 007b
CR0: 8005003b CR2: bffffd66 CR3: 0fb3f000 CR4: 000006d0
  [<c0102516>] show_regs+0x146/0x170
  [<c0207a81>] __handle_sysrq+0x71/0xf0
  [<c021a5aa>] receive_chars+0x11a/0x230
  [<c021a9bd>] serial8250_interrupt+0xdd/0xe0
  [<c0106c96>] handle_IRQ_event+0x36/0x70
  [<c0107063>] do_IRQ+0xe3/0x1b0
  [<c0104cf0>] common_interrupt+0x18/0x20
  [<c010235b>] cpu_idle+0x3b/0x50
  [<c039cb8b>] start_kernel+0x16b/0x190
  [<c0100211>] 0xc0100211



telnet> send brk
SysRq : Show Memory
Mem-info:
DMA per-cpu:
cpu 0 hot: low 2, high 6, batch 1
cpu 0 cold: low 0, high 2, batch 1
Normal per-cpu:
cpu 0 hot: low 30, high 90, batch 15
cpu 0 cold: low 0, high 30, batch 15
HighMem per-cpu: empty

Free pages:        1348kB (0kB HighMem)
Active:26008 inactive:2118 dirty:10 writeback:0 unstable:0 free:337 
slab:16168 mapped:25917 pagetables:12564
DMA free:172kB min:32kB low:64kB high:96kB active:160kB inactive:48kB 
present:16384kB
protections[]: 0 0 0
Normal free:1176kB min:480kB low:960kB high:1440kB active:103872kB 
inactive:8424kB present:245760kB
protections[]: 0 0 0
HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB 
present:0kB
protections[]: 0 0 0
DMA: 23*4kB 2*8kB 0*16kB 0*32kB 1*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 
0*2048kB 0*4096kB = 172kB
Normal: 54*4kB 8*8kB 14*16kB 5*32kB 2*64kB 1*128kB 1*256kB 0*512kB 
0*1024kB 0*2048kB 0*4096kB = 1176kB
HighMem: empty
Swap cache: add 66873, delete 65915, find 54916/55208, race 0+0
Free swap:       805560kB
65536 pages of RAM
0 pages of HIGHMEM
1786 reserved pages
600111 pages shared
958 pages swap cached




cron          S C1205F60     0  2614    435          2628  2613 (NOTLB)
cde0bc9c 00000082 00000000 c1205f60 c12068c0 c4dba97c cde0bc94 c014584e
        c4dba940 c57d3468 b7fe7000 00000001 c1205f60 001aff4a d6c5bd65 0003d278
        c8c5e530 cf9aed40 7fffffff 00000001 cde0bcd8 c02f9f04 b7fe7000 00000001
Call Trace:
  [<c02f9f04>] schedule_timeout+0xb4/0xc0
  [<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
  [<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
  [<c0293cab>] sock_sendmsg+0xbb/0xe0
  [<c0295161>] sys_sendto+0xe1/0x100
  [<c02951b2>] sys_send+0x32/0x40
  [<c0295a1a>] sys_socketcall+0x13a/0x250
  [<c0104383>] syscall_call+0x7/0xb
cron          S C1205F60     0  2628    435          2640  2614 (NOTLB)
cc738c9c 00000086 c8c5e930 c1205f60 c82e1b7c c4dba73c cc738c94 00000000
        c4dba700 00000007 cc738cac cebb0690 c1205f60 00289ad9 af39036a 0003d2be
        c8c5ea90 cf9aed40 7fffffff 00000001 cc738cd8 c02f9f04 00000246 000000d0
Call Trace:
  [<c02f9f04>] schedule_timeout+0xb4/0xc0
  [<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
  [<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
  [<c0293cab>] sock_sendmsg+0xbb/0xe0
  [<c0295161>] sys_sendto+0xe1/0x100
  [<c02951b2>] sys_send+0x32/0x40
  [<c0295a1a>] sys_socketcall+0x13a/0x250
  [<c0104383>] syscall_call+0x7/0xb

cron          S C1205F60     0  2640    435          2655  2628 (NOTLB)
c2bfbc9c 00000082 c721b9d0 c1205f60 cd21db7c c4dba07c c2bfbc94 00000000
        c4dba040 00000007 cf06b200 cebb0690 c1205f60 0027ced4 87b993a6 0003d304
        c721bb30 cf9aed40 7fffffff 00000001 c2bfbcd8 c02f9f04 c138f6c0 c9bdea4c
Call Trace:
  [<c02f9f04>] schedule_timeout+0xb4/0xc0
  [<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
  [<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
  [<c0293cab>] sock_sendmsg+0xbb/0xe0
  [<c0295161>] sys_sendto+0xe1/0x100
  [<c02951b2>] sys_send+0x32/0x40
  [<c0295a1a>] sys_socketcall+0x13a/0x250
  [<c0104383>] syscall_call+0x7/0xb
sshd          S C1205F60     0  2653   1410          2654 17528 (NOTLB)
c163dc9c 00000086 c721b470 c1205f60 00000000 00000000 00000000 00000000
        c4dba4c0 00000007 cf06b560 c721af10 c1205f60 00039ddd 67910d34 0003d32a
        c721b5d0 cf9aed40 7fffffff 00000001 c163dcd8 c02f9f04 c138f6c0 c9bdea4c
Call Trace:
  [<c02f9f04>] schedule_timeout+0xb4/0xc0
  [<c02ef86e>] unix_wait_for_peer+0xbe/0xd0
  [<c02f028a>] unix_dgram_sendmsg+0x26a/0x500
  [<c0293cab>] sock_sendmsg+0xbb/0xe0
  [<c0295161>] sys_sendto+0xe1/0x100
  [<c02951b2>] sys_send+0x32/0x40
  [<c0295a1a>] sys_socketcall+0x13a/0x250
  [<c0104383>] syscall_call+0x7/0xb





-Chris

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2004-11-15 20:19 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-11-13  3:45 deadlock with 2.6.9 Chuck Ebbert
2004-11-15 20:23 ` Bill Davidsen
  -- strict thread matches above, loose matches on Subject: below --
2004-11-08  1:11 Chuck Ebbert
2004-11-07  5:55 Chuck Ebbert
2004-11-07  7:49 ` Chris Stromsoe
2004-11-07  7:53   ` Chris Stromsoe
2004-11-07  8:22 ` Bernd Eckenfels
2004-11-02  7:03 Chris Stromsoe
2004-11-04 17:01 ` Chris Stromsoe
2004-11-06 22:36   ` Chris Stromsoe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox