* 2.6.20.3 AMD64 oops in CFQ code
@ 2007-03-22 12:38 linux
2007-03-22 18:41 ` Jens Axboe
2007-03-22 18:43 ` Aristeu Sergio Rozanski Filho
0 siblings, 2 replies; 10+ messages in thread
From: linux @ 2007-03-22 12:38 UTC (permalink / raw)
To: linux-ide, linux-kernel; +Cc: axboe, linux
This is a uniprocessor AMD64 system running software RAID-5 and RAID-10
over multiple PCIe SiI3132 SATA controllers. The hardware has been very
stable for a long time, but has been acting up of late since I upgraded
to 2.6.20.3. ECC memory should preclude the possibility of bit-flip
errors.
Kernel 2.6.20.3 + linuxpps patches (confined to drivers/serial, and not
actually in use as I stole the serial port for a console).
It takes half a day to reproduce the problem, so bisecting would be painful.
BackupPC_dump mostly writes to a large (1.7 TB) ext3 RAID5 partition.
Here are two oopes, a few minutes (16:31, to be precise) apart.
Unusually, it oopsed twice *without* locking up the system.. Usually,
I see this followed by an error from drivers/input/keyboard/atkbd.c:
printk(KERN_WARNING "atkbd.c: Spurious %s on %s. "
"Some program might be trying access hardware directly.\n",
emitted at 1 Hz with the keyboard LEDs flashing and the system
unresponsive to keyboard or pings.
(I think it was spurious ACK on serio/input0, but my memory may be faulty.)
If anyone has any suggestions, they'd be gratefully received.
Unable to handle kernel NULL pointer dereference at 0000000000000098 RIP:
[<ffffffff8031504a>] cfq_dispatch_insert+0x18/0x68
PGD 777e9067 PUD 78774067 PMD 0
Oops: 0000 [1]
CPU 0
Modules linked in: ecb
Pid: 2837, comm: BackupPC_dump Not tainted 2.6.20.3-g691f5333 #40
RIP: 0010:[<ffffffff8031504a>] [<ffffffff8031504a>] cfq_dispatch_insert+0x18/0x68
RSP: 0018:ffff8100770bbaf8 EFLAGS: 00010092
RAX: ffff81007fb36c80 RBX: 0000000000000000 RCX: 0000000000000001
RDX: 000000010003e4e7 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff81007fb37a00 R08: 00000000ffffffff R09: ffff81005d390298
R10: ffff81007fcb4f80 R11: ffff81007fcb4f80 R12: ffff81007facd280
R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000000
FS: 00002b322d120d30(0000) GS:ffffffff805de000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000098 CR3: 000000007bcf0000 CR4: 00000000000006e0
Process BackupPC_dump (pid: 2837, threadinfo ffff8100770ba000, task ffff81007fc5d8e0)
Stack: 0000000000000000 ffff8100770f39f0 0000000000000000 0000000000000004
0000000000000001 ffffffff80315253 ffffffff803b2607 ffff81005da2bc40
ffff81007fac3800 ffff81007facd280 ffff81007facd280 ffff81005d390298
Call Trace:
[<ffffffff80315253>] cfq_dispatch_requests+0x152/0x512
[<ffffffff803b2607>] scsi_done+0x0/0x18
[<ffffffff8030d9f1>] elv_next_request+0x137/0x147
[<ffffffff803b7ce0>] scsi_request_fn+0x6a/0x33a
[<ffffffff8024d407>] generic_unplug_device+0xa/0xe
[<ffffffff80407ced>] unplug_slaves+0x5b/0x94
[<ffffffff80223d65>] sync_page+0x0/0x40
[<ffffffff80223d9b>] sync_page+0x36/0x40
[<ffffffff80256d45>] __wait_on_bit_lock+0x36/0x65
[<ffffffff80237496>] __lock_page+0x5e/0x64
[<ffffffff8028061d>] wake_bit_function+0x0/0x23
[<ffffffff802074de>] find_get_page+0xe/0x2d
[<ffffffff8020b38e>] do_generic_mapping_read+0x1c2/0x40d
[<ffffffff8020bd80>] file_read_actor+0x0/0x118
[<ffffffff8021422e>] generic_file_aio_read+0x15c/0x19e
[<ffffffff8020bafa>] do_sync_read+0xc9/0x10c
[<ffffffff80210342>] may_open+0x5b/0x1c6
[<ffffffff802805ef>] autoremove_wake_function+0x0/0x2e
[<ffffffff8020a857>] vfs_read+0xaa/0x152
[<ffffffff8020faf3>] sys_read+0x45/0x6e
[<ffffffff8025041e>] system_call+0x7e/0x83
Code: 4c 8b ae 98 00 00 00 4c 8b 70 08 e8 63 fe ff ff 8b 43 28 4c
RIP [<ffffffff8031504a>] cfq_dispatch_insert+0x18/0x68
RSP <ffff8100770bbaf8>
CR2: 0000000000000098
<1>Unable to handle kernel NULL pointer dereference at 0000000000000098 RIP:
[<ffffffff8031504a>] cfq_dispatch_insert+0x18/0x68
PGD 79bd2067 PUD 789f9067 PMD 0
Oops: 0000 [2]
CPU 0
Modules linked in: ecb
Pid: 2834, comm: BackupPC_dump Not tainted 2.6.20.3-g691f5333 #40
RIP: 0010:[<ffffffff8031504a>] [<ffffffff8031504a>] cfq_dispatch_insert+0x18/0x
68
RSP: 0018:ffff8100789b5af8 EFLAGS: 00010092
RAX: ffff81007fb36c80 RBX: 0000000000000000 RCX: 0000000000000001
RDX: 000000010007ac16 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff81007fb37a00 R08: 00000000ffffffff R09: ffff810064dd45e0
R10: ffff81007fcb4f80 R11: ffff81007fcb4f80 R12: ffff81007facd280
R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000000
FS: 00002b0a7c680d30(0000) GS:ffffffff805de000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000098 CR3: 0000000079d36000 CR4: 00000000000006e0
Process BackupPC_dump (pid: 2834, threadinfo ffff8100789b4000, task ffff81007a23
5140)
Stack: 0000000000000000 ffff81007b9ebbd0 0000000000000000 0000000000000004
0000000000000001 ffffffff80315253 ffffffff803b2607 ffff81000e67ba00
ffff81007fac3800 ffff81007facd280 ffff81007facd280 ffff810064dd45e0
Call Trace:
[<ffffffff80315253>] cfq_dispatch_requests+0x152/0x512
[<ffffffff803b2607>] scsi_done+0x0/0x18
[<ffffffff8030d9f1>] elv_next_request+0x137/0x147
[<ffffffff803b7ce0>] scsi_request_fn+0x6a/0x33a
[<ffffffff8024d407>] generic_unplug_device+0xa/0xe
[<ffffffff80407ced>] unplug_slaves+0x5b/0x94
[<ffffffff80223d65>] sync_page+0x0/0x40
[<ffffffff80223d9b>] sync_page+0x36/0x40
[<ffffffff80256d45>] __wait_on_bit_lock+0x36/0x65
[<ffffffff80237496>] __lock_page+0x5e/0x64
[<ffffffff8028061d>] wake_bit_function+0x0/0x23
[<ffffffff802074de>] find_get_page+0xe/0x2d
[<ffffffff8020b38e>] do_generic_mapping_read+0x1c2/0x40d
[<ffffffff8020bd80>] file_read_actor+0x0/0x118
[<ffffffff8021422e>] generic_file_aio_read+0x15c/0x19e
[<ffffffff8020bafa>] do_sync_read+0xc9/0x10c
[<ffffffff80210342>] may_open+0x5b/0x1c6
[<ffffffff802805ef>] autoremove_wake_function+0x0/0x2e
[<ffffffff8020a857>] vfs_read+0xaa/0x152
[<ffffffff8020faf3>] sys_read+0x45/0x6e
[<ffffffff8025041e>] system_call+0x7e/0x83
Code: 4c 8b ae 98 00 00 00 4c 8b 70 08 e8 63 fe ff ff 8b 43 28 4c
RIP [<ffffffff8031504a>] cfq_dispatch_insert+0x18/0x68
RSP <ffff8100789b5af8>
CR2: 0000000000000098
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: 2.6.20.3 AMD64 oops in CFQ code 2007-03-22 12:38 2.6.20.3 AMD64 oops in CFQ code linux @ 2007-03-22 18:41 ` Jens Axboe 2007-03-22 18:43 ` Aristeu Sergio Rozanski Filho 1 sibling, 0 replies; 10+ messages in thread From: Jens Axboe @ 2007-03-22 18:41 UTC (permalink / raw) To: linux; +Cc: linux-ide, linux-kernel On Thu, Mar 22 2007, linux@horizon.com wrote: > This is a uniprocessor AMD64 system running software RAID-5 and RAID-10 > over multiple PCIe SiI3132 SATA controllers. The hardware has been very > stable for a long time, but has been acting up of late since I upgraded > to 2.6.20.3. ECC memory should preclude the possibility of bit-flip > errors. > > Kernel 2.6.20.3 + linuxpps patches (confined to drivers/serial, and not > actually in use as I stole the serial port for a console). > > It takes half a day to reproduce the problem, so bisecting would be painful. > > BackupPC_dump mostly writes to a large (1.7 TB) ext3 RAID5 partition. > > > Here are two oopes, a few minutes (16:31, to be precise) apart. > Unusually, it oopsed twice *without* locking up the system.. Usually, > I see this followed by an error from drivers/input/keyboard/atkbd.c: > printk(KERN_WARNING "atkbd.c: Spurious %s on %s. " > "Some program might be trying access hardware directly.\n", > emitted at 1 Hz with the keyboard LEDs flashing and the system > unresponsive to keyboard or pings. > (I think it was spurious ACK on serio/input0, but my memory may be faulty.) > > > If anyone has any suggestions, they'd be gratefully received. > > > Unable to handle kernel NULL pointer dereference at 0000000000000098 RIP: > [<ffffffff8031504a>] cfq_dispatch_insert+0x18/0x68 > PGD 777e9067 PUD 78774067 PMD 0 > Oops: 0000 [1] > CPU 0 > Modules linked in: ecb > Pid: 2837, comm: BackupPC_dump Not tainted 2.6.20.3-g691f5333 #40 > RIP: 0010:[<ffffffff8031504a>] [<ffffffff8031504a>] cfq_dispatch_insert+0x18/0x68 > RSP: 0018:ffff8100770bbaf8 EFLAGS: 00010092 > RAX: ffff81007fb36c80 RBX: 0000000000000000 RCX: 0000000000000001 > RDX: 000000010003e4e7 RSI: 0000000000000000 RDI: 0000000000000000 > RBP: ffff81007fb37a00 R08: 00000000ffffffff R09: ffff81005d390298 > R10: ffff81007fcb4f80 R11: ffff81007fcb4f80 R12: ffff81007facd280 > R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000000 > FS: 00002b322d120d30(0000) GS:ffffffff805de000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000098 CR3: 000000007bcf0000 CR4: 00000000000006e0 > Process BackupPC_dump (pid: 2837, threadinfo ffff8100770ba000, task ffff81007fc5d8e0) > Stack: 0000000000000000 ffff8100770f39f0 0000000000000000 0000000000000004 > 0000000000000001 ffffffff80315253 ffffffff803b2607 ffff81005da2bc40 > ffff81007fac3800 ffff81007facd280 ffff81007facd280 ffff81005d390298 > Call Trace: > [<ffffffff80315253>] cfq_dispatch_requests+0x152/0x512 > [<ffffffff803b2607>] scsi_done+0x0/0x18 > [<ffffffff8030d9f1>] elv_next_request+0x137/0x147 > [<ffffffff803b7ce0>] scsi_request_fn+0x6a/0x33a > [<ffffffff8024d407>] generic_unplug_device+0xa/0xe > [<ffffffff80407ced>] unplug_slaves+0x5b/0x94 > [<ffffffff80223d65>] sync_page+0x0/0x40 > [<ffffffff80223d9b>] sync_page+0x36/0x40 > [<ffffffff80256d45>] __wait_on_bit_lock+0x36/0x65 > [<ffffffff80237496>] __lock_page+0x5e/0x64 > [<ffffffff8028061d>] wake_bit_function+0x0/0x23 > [<ffffffff802074de>] find_get_page+0xe/0x2d > [<ffffffff8020b38e>] do_generic_mapping_read+0x1c2/0x40d > [<ffffffff8020bd80>] file_read_actor+0x0/0x118 > [<ffffffff8021422e>] generic_file_aio_read+0x15c/0x19e > [<ffffffff8020bafa>] do_sync_read+0xc9/0x10c > [<ffffffff80210342>] may_open+0x5b/0x1c6 > [<ffffffff802805ef>] autoremove_wake_function+0x0/0x2e > [<ffffffff8020a857>] vfs_read+0xaa/0x152 > [<ffffffff8020faf3>] sys_read+0x45/0x6e > [<ffffffff8025041e>] system_call+0x7e/0x83 3 (I think) seperate instances of this, each involving raid5. Is your array degraded or fully operational? -- Jens Axboe ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.20.3 AMD64 oops in CFQ code 2007-03-22 12:38 2.6.20.3 AMD64 oops in CFQ code linux 2007-03-22 18:41 ` Jens Axboe @ 2007-03-22 18:43 ` Aristeu Sergio Rozanski Filho 1 sibling, 0 replies; 10+ messages in thread From: Aristeu Sergio Rozanski Filho @ 2007-03-22 18:43 UTC (permalink / raw) To: linux; +Cc: linux-ide, linux-kernel, axboe > This is a uniprocessor AMD64 system running software RAID-5 and RAID-10 > over multiple PCIe SiI3132 SATA controllers. The hardware has been very > stable for a long time, but has been acting up of late since I upgraded > to 2.6.20.3. ECC memory should preclude the possibility of bit-flip > errors. Tried checking the memory with memtest86? Do you have k8_edac module loaded? If you don't, I'd recomend using it to get reports of recoverable/unrecoverable memory errors, check http://bluesmoke.sf.net/ for latest version. -- Aristeu ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <17923.8970.917375.917772@notabene.brown>]
* Re: 2.6.20.3 AMD64 oops in CFQ code [not found] <17923.8970.917375.917772@notabene.brown> @ 2007-03-23 17:46 ` linux 2007-04-03 5:49 ` Tejun Heo 0 siblings, 1 reply; 10+ messages in thread From: linux @ 2007-03-23 17:46 UTC (permalink / raw) To: dan.j.williams, linux-ide, neilb Cc: cebbert, jens.axboe, linux-kernel, linux-kernel, linux-raid, linux As an additional data point, here's a libata problem I'm having trying to rebuild the array. I have six identical 400 GB drives (ST3400832AS), and one is giving me hassles. I've run SMART short and long diagnostics, badblocks, and Seagate's "seatools" diagnostic software, and none of these find problems. It is the only one of the six with a non-zero reallocated sector count (it's 26). Anyway, the drive is partitioned into a 45G RAID-10 part and a 350G RAID-5 part. The RAID-10 part integrated successfully, but the RAID-5 got to about 60% and then puked: ata5.00: exception Emask 0x0 SAct 0x1ef SErr 0x0 action 0x2 frozen ata5.00: cmd 61/c0:00:d2:d0:b9/00:00:1c:00:00/40 tag 0 cdb 0x0 data 98304 out res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) ata5.00: cmd 61/40:08:92:d1:b9/00:00:1c:00:00/40 tag 1 cdb 0x0 data 32768 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata5.00: cmd 61/00:10:d2:d1:b9/01:00:1c:00:00/40 tag 2 cdb 0x0 data 131072 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata5.00: cmd 61/00:18:d2:d2:b9/01:00:1c:00:00/40 tag 3 cdb 0x0 data 131072 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata5.00: cmd 61/00:28:d2:d3:b9/01:00:1c:00:00/40 tag 5 cdb 0x0 data 131072 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata5.00: cmd 61/00:30:d2:d4:b9/01:00:1c:00:00/40 tag 6 cdb 0x0 data 131072 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata5.00: cmd 61/00:38:d2:d5:b9/01:00:1c:00:00/40 tag 7 cdb 0x0 data 131072 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata5.00: cmd 61/00:40:d2:d6:b9/01:00:1c:00:00/40 tag 8 cdb 0x0 data 131072 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata5: soft resetting port ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata5.00: configured for UDMA/100 ata5: EH complete SCSI device sde: 781422768 512-byte hdwr sectors (400088 MB) sde: Write Protect is off SCSI device sde: write cache: enabled, read cache: enabled, doesn't support DPO or FUA ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata5.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0 res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) ata5: soft resetting port ata5: softreset failed (timeout) ata5: softreset failed, retrying in 5 secs ata5: hard resetting port ata5: softreset failed (timeout) ata5: follow-up softreset failed, retrying in 5 secs ata5: hard resetting port ata5: softreset failed (timeout) ata5: reset failed, giving up ata5.00: disabled ata5: EH complete sd 4:0:0:0: SCSI error: return code = 0x00040000 end_request: I/O error, dev sde, sector 91795259 md: super_written gets error=-5, uptodate=0 raid10: Disk failure on sde3, disabling device. Operation continuing on 5 devices sd 4:0:0:0: SCSI error: return code = 0x00040000 end_request: I/O error, dev sde, sector 481942994 raid5: Disk failure on sde4, disabling device. Operation continuing on 5 devices sd 4:0:0:0: SCSI error: return code = 0x00040000 end_request: I/O error, dev sde, sector 481944018 md: md5: recovery done. RAID10 conf printout: --- wd:5 rd:6 disk 0, wo:0, o:1, dev:sdb3 disk 1, wo:0, o:1, dev:sdc3 disk 2, wo:0, o:1, dev:sdd3 disk 3, wo:1, o:0, dev:sde3 disk 4, wo:0, o:1, dev:sdf3 disk 5, wo:0, o:1, dev:sda3 RAID10 conf printout: --- wd:5 rd:6 disk 0, wo:0, o:1, dev:sdb3 disk 1, wo:0, o:1, dev:sdc3 disk 2, wo:0, o:1, dev:sdd3 disk 4, wo:0, o:1, dev:sdf3 disk 5, wo:0, o:1, dev:sda3 RAID5 conf printout: --- rd:6 wd:5 disk 0, o:1, dev:sda4 disk 1, o:1, dev:sdb4 disk 2, o:1, dev:sdc4 disk 3, o:1, dev:sdd4 disk 4, o:0, dev:sde4 disk 5, o:1, dev:sdf4 RAID5 conf printout: --- rd:6 wd:5 disk 0, o:1, dev:sda4 disk 1, o:1, dev:sdb4 disk 2, o:1, dev:sdc4 disk 3, o:1, dev:sdd4 disk 5, o:1, dev:sdf4 The first error address is just barely inside the RAID-10 part (which ends at sector 91,795,410), while the second and third errors (at 481,942,994) look like where the reconstruction was working. Anyway, what's annoying is that I can't figure out how to bring the drive back on line without resetting the box. It's in a hot-swap enclosure, but power cycling the drive doesn't seem to help. I thought libata hotplug was working? (SiI3132 card, using the sil24 driver.) (H'm... after rebooting, reallocated sectors jumped from 26 to 39. Something is up with that drive.) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.20.3 AMD64 oops in CFQ code 2007-03-23 17:46 ` linux @ 2007-04-03 5:49 ` Tejun Heo 2007-04-03 13:03 ` linux 2007-04-04 23:22 ` Bill Davidsen 0 siblings, 2 replies; 10+ messages in thread From: Tejun Heo @ 2007-04-03 5:49 UTC (permalink / raw) To: linux Cc: dan.j.williams, linux-ide, neilb, cebbert, jens.axboe, linux-kernel, linux-kernel, linux-raid [resending. my mail service was down for more than a week and this message didn't get delivered.] linux@horizon.com wrote: > > Anyway, what's annoying is that I can't figure out how to bring the > > drive back on line without resetting the box. It's in a hot-swap enclosure, > > but power cycling the drive doesn't seem to help. I thought libata hotplug > > was working? (SiI3132 card, using the sil24 driver.) Yeah, it's working but failing resets are considered highly dangerous (in that the controller status is unknown and may cause something dangerous like screaming interrupts) and port is muted after that. The plan is to handle this with polling hotplug such that libata tries to revive the port if PHY status change is detected by polling. Patches are available but they need other things to resolved to get integrated. I think it'll happen before the summer. Anyways, you can tell libata to retry the port by manually telling it to rescan the port (echo - - - > /sys/class/scsi_host/hostX/scan). > > (H'm... after rebooting, reallocated sectors jumped from 26 to 39. > > Something is up with that drive.) Yeap, seems like a broken drive to me. Thanks. -- tejun ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.20.3 AMD64 oops in CFQ code 2007-04-03 5:49 ` Tejun Heo @ 2007-04-03 13:03 ` linux 2007-04-03 13:11 ` Tejun Heo 2007-04-04 23:22 ` Bill Davidsen 1 sibling, 1 reply; 10+ messages in thread From: linux @ 2007-04-03 13:03 UTC (permalink / raw) To: linux, teheo Cc: cebbert, dan.j.williams, jens.axboe, linux-ide, linux-kernel, linux-kernel, linux-raid, neilb linux@horizon.com wrote: >> Anyway, what's annoying is that I can't figure out how to bring the >> drive back on line without resetting the box. It's in a hot-swap enclosure, >> but power cycling the drive doesn't seem to help. I thought libata hotplug >> was working? (SiI3132 card, using the sil24 driver.) > Yeah, it's working but failing resets are considered highly dangerous > (in that the controller status is unknown and may cause something > dangerous like screaming interrupts) and port is muted after that. The > plan is to handle this with polling hotplug such that libata tries to > revive the port if PHY status change is detected by polling. Patches > are available but they need other things to resolved to get integrated. > I think it'll happen before the summer. > Anyways, you can tell libata to retry the port by manually telling it to > rescan the port (echo - - - > /sys/class/scsi_host/hostX/scan). Ah, thank you! I have to admit, that is at least as mysterious as any Microsoft registry tweak. >> (H'm... after rebooting, reallocated sectors jumped from 26 to 39. >> Something is up with that drive.) > Yeap, seems like a broken drive to me. Actually, after a few rounds, the reallocated sectors stabilized at 56 and all is working well again. It's like there was a major problem with error handling. The problem is that I don't know where the blame lies. In case it helps, here is the syslog associated with the first failure: 07:17:56: ata5.00: exception Emask 0x0 SAct 0x7fffff SErr 0x0 action 0x2 frozen 07:20:15: ata5.00: cmd 61/30:00:a2:d8:b9/00:00:1c:00:00/40 tag 0 cdb 0x0 data 24576 out 07:20:15: res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:15: ata5.00: cmd 60/08:08:9a:7c:b9/00:00:1c:00:00/40 tag 1 cdb 0x0 data 4096 in 07:20:15: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:15: ata5.00: cmd 60/08:10:ca:7c:b9/00:00:1c:00:00/40 tag 2 cdb 0x0 data 4096 in 07:20:15: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:15: ata5.00: cmd 60/20:18:b2:d6:b9/00:00:1c:00:00/40 tag 3 cdb 0x0 data 16384 in 07:20:15: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:15: ata5.00: cmd 61/08:20:fa:a2:b6/00:00:1c:00:00/40 tag 4 cdb 0x0 data 4096 out 07:20:15: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:15: ata5.00: cmd 60/40:28:1a:d7:b9/00:00:1c:00:00/40 tag 5 cdb 0x0 data 32768 in 07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:16: ata5.00: cmd 60/00:30:d2:11:c0/01:00:24:00:00/40 tag 6 cdb 0x0 data 131072 in 07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:16: ata5.00: cmd 60/28:38:2a:0b:b9/00:00:1c:00:00/40 tag 7 cdb 0x0 data 20480 in 07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:16: ata5.00: cmd 60/08:40:82:7c:b9/00:00:1c:00:00/40 tag 8 cdb 0x0 data 4096 in 07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:16: ata5.00: cmd 61/58:48:02:a3:b6/00:00:1c:00:00/40 tag 9 cdb 0x0 data 45056 out 07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:16: ata5.00: cmd 60/10:50:52:49:ba/00:00:1c:00:00/40 tag 10 cdb 0x0 data 8192 in 07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:16: ata5.00: cmd 60/08:58:6a:49:ba/00:00:1c:00:00/40 tag 11 cdb 0x0 data 4096 in 07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:16: ata5.00: cmd 60/08:60:7a:49:ba/00:00:1c:00:00/40 tag 12 cdb 0x0 data 4096 in 07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:16: ata5.00: cmd 60/10:68:8a:49:ba/00:00:1c:00:00/40 tag 13 cdb 0x0 data 8192 in 07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:16: ata5.00: cmd 60/08:70:a2:49:ba/00:00:1c:00:00/40 tag 14 cdb 0x0 data 4096 in 07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:16: ata5.00: cmd 60/08:78:ba:49:ba/00:00:1c:00:00/40 tag 15 cdb 0x0 data 4096 in 07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:16: ata5.00: cmd 60/18:80:0a:b0:b8/00:00:1c:00:00/40 tag 16 cdb 0x0 data 12288 in 07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:16: ata5.00: cmd 60/10:88:2a:b0:b8/00:00:1c:00:00/40 tag 17 cdb 0x0 data 8192 in 07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:16: ata5.00: cmd 60/08:90:52:0a:b9/00:00:1c:00:00/40 tag 18 cdb 0x0 data 4096 in 07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:16: ata5.00: cmd 61/40:98:d2:d6:b9/00:00:1c:00:00/40 tag 19 cdb 0x0 data 32768 out 07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:16: ata5.00: cmd 61/40:a0:92:d7:b9/00:00:1c:00:00/40 tag 20 cdb 0x0 data 32768 out 07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:16: ata5.00: cmd 61/50:a8:52:d8:b9/00:00:1c:00:00/40 tag 21 cdb 0x0 data 40960 out 07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:16: ata5.00: cmd 61/78:b0:da:d7:b9/00:00:1c:00:00/40 tag 22 cdb 0x0 data 61440 out 07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:16: ata5: soft resetting port 07:20:16: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 300) 07:20:16: ata5.00: configured for UDMA/100 07:20:16: ata5: EH complete 07:20:16: SCSI device sde: 781422768 512-byte hdwr sectors (400088 MB) 07:20:16: sde: Write Protect is off 07:20:16: sde: Mode Sense: 00 3a 00 00 07:20:16: SCSI device sde: write cache: enabled, read cache: enabled, doesn't support DPO or FUA 07:20:16: ata5.00: exception Emask 0x0 SAct 0xb0000 SErr 0x0 action 0x2 frozen 07:20:16: ata5.00: cmd 60/00:80:d2:11:c0/01:00:24:00:00/40 tag 16 cdb 0x0 data 131072 in 07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:16: ata5.00: cmd 60/40:88:1a:d7:b9/00:00:1c:00:00/40 tag 17 cdb 0x0 data 32768 in 07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:16: ata5.00: cmd 60/20:98:b2:d6:b9/00:00:1c:00:00/40 tag 19 cdb 0x0 data 16384 in 07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 07:20:16: ata5: soft resetting port 07:20:16: ata5: softreset failed (timeout) 07:20:16: ata5: softreset failed, retrying in 5 secs 07:20:16: ata5: hard resetting port 07:20:16: ata5: softreset failed (timeout) 07:20:16: ata5: follow-up softreset failed, retrying in 5 secs 07:20:16: ata5: hard resetting port 07:20:16: ata5: softreset failed (timeout) 07:20:16: ata5: reset failed, giving up 07:20:16: ata5.00: disabled 07:20:16: ata5: EH complete 07:20:16: sd 4:0:0:0: SCSI error: return code = 0x00040000 07:20:16: end_request: I/O error, dev sde, sector 481941170 07:20:16: sd 4:0:0:0: SCSI error: return code = 0x00040000 07:20:16: end_request: I/O error, dev sde, sector 481941274 07:20:16: sd 4:0:0:0: SCSI error: return code = 0x00040000 07:20:16: end_request: I/O error, dev sde, sector 616567250 07:20:16: sd 4:0:0:0: SCSI error: return code = 0x00040000 07:20:16: end_request: I/O error, dev sde, sector 779457538 07:20:16: md: super_written gets error=-5, uptodate=0 07:20:16: raid5: Disk failure on sde4, disabling device. Operation continuing on 5 devices 07:20:16: sd 4:0:0:0: SCSI error: return code = 0x00040000 07:20:16: end_request: I/O error, dev sde, sector 91795259 07:20:16: md: super_written gets error=-5, uptodate=0 07:20:16: raid10: Disk failure on sde3, disabling device. 07:20:16: ^IOperation continuing on 5 devices 07:20:16: RAID5 conf printout: 07:20:16: --- rd:6 wd:5 07:20:16: disk 0, o:1, dev:sda4 07:20:16: disk 1, o:1, dev:sdb4 07:20:16: disk 2, o:1, dev:sdc4 07:20:16: disk 3, o:1, dev:sdd4 07:20:16: disk 4, o:0, dev:sde4 07:20:16: disk 5, o:1, dev:sdf4 07:20:16: RAID5 conf printout: 07:20:16: --- rd:6 wd:5 07:20:16: disk 0, o:1, dev:sda4 07:20:16: disk 1, o:1, dev:sdb4 07:20:16: disk 2, o:1, dev:sdc4 07:20:16: disk 3, o:1, dev:sdd4 07:20:16: disk 5, o:1, dev:sdf4 07:20:16: RAID10 conf printout: 07:20:16: --- wd:5 rd:6 07:20:16: disk 0, wo:0, o:1, dev:sdb3 07:20:16: disk 1, wo:0, o:1, dev:sdc3 07:20:16: disk 2, wo:0, o:1, dev:sdd3 07:20:16: disk 3, wo:1, o:0, dev:sde3 07:20:16: disk 4, wo:0, o:1, dev:sdf3 07:20:16: disk 5, wo:0, o:1, dev:sda3 07:20:16: RAID10 conf printout: 07:20:16: --- wd:5 rd:6 07:20:16: disk 0, wo:0, o:1, dev:sdb3 07:20:16: disk 1, wo:0, o:1, dev:sdc3 07:20:16: disk 2, wo:0, o:1, dev:sdd3 07:20:16: disk 4, wo:0, o:1, dev:sdf3 07:20:16: disk 5, wo:0, o:1, dev:sda3 And another one while recovering. I notice that the reset timeouts seem to take much longer this time. Or could the above have had similar timeouts, but syslog was blocked by the error and didn't receive and timestamp the messages until recovery was complete? 14:49:55: RAID5 conf printout: 14:49:55: --- rd:6 wd:5 14:49:55: disk 0, o:1, dev:sda4 14:49:55: disk 1, o:1, dev:sdb4 14:49:55: disk 2, o:1, dev:sdc4 14:49:55: disk 3, o:1, dev:sdd4 14:49:55: disk 4, o:1, dev:sde4 14:49:55: disk 5, o:1, dev:sdf4 14:49:55: md: recovery of RAID array md5 14:49:55: md: minimum _guaranteed_ speed: 1000 KB/sec/disk. 14:49:55: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. 14:49:55: md: using 128k window, over a total of 343831040 blocks. 14:56:13: ata5.00: exception Emask 0x0 SAct 0xff SErr 0x0 action 0x2 frozen 14:56:13: ata5.00: cmd 61/f8:00:da:d6:b9/00:00:1c:00:00/40 tag 0 cdb 0x0 data 126976 out 14:56:13: res 40/00:01:09:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) 14:56:13: ata5.00: cmd 61/90:08:d2:2f:ba/00:00:1c:00:00/40 tag 1 cdb 0x0 data 73728 out 14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 14:56:13: ata5.00: cmd 61/70:10:62:30:ba/01:00:1c:00:00/40 tag 2 cdb 0x0 data 188416 out 14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 14:56:13: ata5.00: cmd 61/00:18:d2:31:ba/01:00:1c:00:00/40 tag 3 cdb 0x0 data 131072 out 14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 14:56:13: ata5.00: cmd 61/00:20:d2:32:ba/01:00:1c:00:00/40 tag 4 cdb 0x0 data 131072 out 14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 14:56:13: ata5.00: cmd 61/00:28:d2:33:ba/01:00:1c:00:00/40 tag 5 cdb 0x0 data 131072 out 14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 14:56:13: ata5.00: cmd 61/00:30:d2:34:ba/01:00:1c:00:00/40 tag 6 cdb 0x0 data 131072 out 14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 14:56:13: ata5.00: cmd 61/00:38:d2:35:ba/01:00:1c:00:00/40 tag 7 cdb 0x0 data 131072 out 14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 14:56:13: ata5: soft resetting port 14:56:43: ata5: softreset failed (timeout) 14:56:43: ata5: softreset failed, retrying in 5 secs 14:56:48: ata5: hard resetting port 14:57:20: ata5: softreset failed (timeout) 14:57:20: ata5: follow-up softreset failed, retrying in 5 secs 14:57:25: ata5: hard resetting port 14:57:58: ata5: softreset failed (timeout) 14:57:58: ata5: reset failed, giving up 14:57:58: ata5.00: disabled 14:57:58: ata5: EH complete 14:57:58: sd 4:0:0:0: SCSI error: return code = 0x00040000 14:57:58: end_request: I/O error, dev sde, sector 481965522 14:57:58: raid5: Disk failure on sde4, disabling device. Operation continuing on 5 devices 14:57:58: sd 4:0:0:0: SCSI error: return code = 0x00040000 14:57:58: end_request: I/O error, dev sde, sector 481965266 14:57:58: sd 4:0:0:0: SCSI error: return code = 0x00040000 14:57:58: end_request: I/O error, dev sde, sector 481965010 14:57:58: sd 4:0:0:0: SCSI error: return code = 0x00040000 14:57:58: end_request: I/O error, dev sde, sector 481964754 14:57:58: sd 4:0:0:0: SCSI error: return code = 0x00040000 14:57:58: end_request: I/O error, dev sde, sector 481964498 14:57:58: sd 4:0:0:0: SCSI error: return code = 0x00040000 14:57:58: end_request: I/O error, dev sde, sector 481964130 14:57:58: sd 4:0:0:0: SCSI error: return code = 0x00040000 14:57:58: end_request: I/O error, dev sde, sector 481963986 14:57:58: sd 4:0:0:0: SCSI error: return code = 0x00040000 14:57:58: end_request: I/O error, dev sde, sector 481941210 14:57:58: md: md5: recovery done. 14:57:58: RAID5 conf printout: 14:57:58: --- rd:6 wd:5 14:57:58: disk 0, o:1, dev:sda4 14:57:58: disk 1, o:1, dev:sdb4 14:57:58: disk 2, o:1, dev:sdc4 14:57:58: disk 3, o:1, dev:sdd4 14:57:58: disk 4, o:0, dev:sde4 14:57:58: disk 5, o:1, dev:sdf4 14:57:58: RAID5 conf printout: 14:57:58: --- rd:6 wd:5 14:57:58: disk 0, o:1, dev:sda4 14:57:58: disk 1, o:1, dev:sdb4 14:57:58: disk 2, o:1, dev:sdc4 14:57:58: disk 3, o:1, dev:sdd4 14:57:58: disk 5, o:1, dev:sdf4 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.20.3 AMD64 oops in CFQ code 2007-04-03 13:03 ` linux @ 2007-04-03 13:11 ` Tejun Heo 0 siblings, 0 replies; 10+ messages in thread From: Tejun Heo @ 2007-04-03 13:11 UTC (permalink / raw) To: linux Cc: cebbert, dan.j.williams, jens.axboe, linux-ide, linux-kernel, linux-kernel, linux-raid, neilb linux@horizon.com wrote: > linux@horizon.com wrote: >>> Anyway, what's annoying is that I can't figure out how to bring the >>> drive back on line without resetting the box. It's in a hot-swap enclosure, >>> but power cycling the drive doesn't seem to help. I thought libata hotplug >>> was working? (SiI3132 card, using the sil24 driver.) > >> Yeah, it's working but failing resets are considered highly dangerous >> (in that the controller status is unknown and may cause something >> dangerous like screaming interrupts) and port is muted after that. The >> plan is to handle this with polling hotplug such that libata tries to >> revive the port if PHY status change is detected by polling. Patches >> are available but they need other things to resolved to get integrated. >> I think it'll happen before the summer. > >> Anyways, you can tell libata to retry the port by manually telling it to >> rescan the port (echo - - - > /sys/class/scsi_host/hostX/scan). > > Ah, thank you! I have to admit, that is at least as mysterious as any > Microsoft registry tweak. Polling hotplug should fix this. I thought I would be able to merge it much earlier. I apparently was way too optimistic. :-( >>> (H'm... after rebooting, reallocated sectors jumped from 26 to 39. >>> Something is up with that drive.) > >> Yeap, seems like a broken drive to me. > > Actually, after a few rounds, the reallocated sectors stabilized at 56 > and all is working well again. It's like there was a major problem with > error handling. > > The problem is that I don't know where the blame lies. I'm pretty sure it's the firmware's fault. It's not supposed to go out for lunch like that even when internal error occurs. -- tejun ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.20.3 AMD64 oops in CFQ code 2007-04-03 5:49 ` Tejun Heo 2007-04-03 13:03 ` linux @ 2007-04-04 23:22 ` Bill Davidsen 2007-04-05 4:13 ` Lee Revell 1 sibling, 1 reply; 10+ messages in thread From: Bill Davidsen @ 2007-04-04 23:22 UTC (permalink / raw) To: Tejun Heo Cc: linux, dan.j.williams, linux-ide, neilb, cebbert, jens.axboe, linux-kernel, linux-kernel, linux-raid Tejun Heo wrote: > [resending. my mail service was down for more than a week and this > message didn't get delivered.] > > linux@horizon.com wrote: > >>> Anyway, what's annoying is that I can't figure out how to bring the >>> drive back on line without resetting the box. It's in a hot-swap >>> > enclosure, > >>> but power cycling the drive doesn't seem to help. I thought libata >>> > hotplug > >>> was working? (SiI3132 card, using the sil24 driver.) >>> > > Yeah, it's working but failing resets are considered highly dangerous > (in that the controller status is unknown and may cause something > dangerous like screaming interrupts) and port is muted after that. The > plan is to handle this with polling hotplug such that libata tries to > revive the port if PHY status change is detected by polling. Patches > are available but they need other things to resolved to get integrated. > I think it'll happen before the summer. > > Anyways, you can tell libata to retry the port by manually telling it to > rescan the port (echo - - - > /sys/class/scsi_host/hostX/scan). > I won't say that's voodoo, but if I ever did it I'd wipe down my keyboard with holy water afterward. ;-) Well, I did save the message in my tricks file, but it sounds like a last ditch effort after something get very wrong. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.20.3 AMD64 oops in CFQ code 2007-04-04 23:22 ` Bill Davidsen @ 2007-04-05 4:13 ` Lee Revell 2007-04-05 4:29 ` Tejun Heo 0 siblings, 1 reply; 10+ messages in thread From: Lee Revell @ 2007-04-05 4:13 UTC (permalink / raw) To: Bill Davidsen Cc: Tejun Heo, linux, dan.j.williams, linux-ide, neilb, cebbert, jens.axboe, linux-kernel, linux-kernel, linux-raid On 4/4/07, Bill Davidsen <davidsen@tmr.com> wrote: > I won't say that's voodoo, but if I ever did it I'd wipe down my > keyboard with holy water afterward. ;-) > > Well, I did save the message in my tricks file, but it sounds like a > last ditch effort after something get very wrong. Would it reallty be an impediment to development if the kernel maintainers simply refuse to merge patches that add new sysfs entries without corresponding documentation? Lee ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.20.3 AMD64 oops in CFQ code 2007-04-05 4:13 ` Lee Revell @ 2007-04-05 4:29 ` Tejun Heo 0 siblings, 0 replies; 10+ messages in thread From: Tejun Heo @ 2007-04-05 4:29 UTC (permalink / raw) To: Lee Revell Cc: Bill Davidsen, linux, dan.j.williams, linux-ide, neilb, cebbert, jens.axboe, linux-kernel, linux-kernel, linux-raid Lee Revell wrote: > On 4/4/07, Bill Davidsen <davidsen@tmr.com> wrote: >> I won't say that's voodoo, but if I ever did it I'd wipe down my >> keyboard with holy water afterward. ;-) >> >> Well, I did save the message in my tricks file, but it sounds like a >> last ditch effort after something get very wrong. Which actually is true. ATA ports failing to reset indicate something is very wrong. Either the attached device or the controller is broken and libata shuts down the port to protect the rest of the system from it. The manual scan requests tell libata to give it one more shot and polling hotplug can do that automatically. Anyways, this shouldn't happen unless you have a broken piece of hardware. > Would it reallty be an impediment to development if the kernel > maintainers simply refuse to merge patches that add new sysfs entries > without corresponding documentation? SCSI host scan nodes have been there for a long time. I think it's documented somewhere. -- tejun ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2007-04-05 4:29 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-22 12:38 2.6.20.3 AMD64 oops in CFQ code linux
2007-03-22 18:41 ` Jens Axboe
2007-03-22 18:43 ` Aristeu Sergio Rozanski Filho
[not found] <17923.8970.917375.917772@notabene.brown>
2007-03-23 17:46 ` linux
2007-04-03 5:49 ` Tejun Heo
2007-04-03 13:03 ` linux
2007-04-03 13:11 ` Tejun Heo
2007-04-04 23:22 ` Bill Davidsen
2007-04-05 4:13 ` Lee Revell
2007-04-05 4:29 ` Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).