* 2.6.20.3 AMD64 oops in CFQ code
@ 2007-03-22 12:38 linux
2007-03-22 18:41 ` Jens Axboe
2007-03-22 18:43 ` Aristeu Sergio Rozanski Filho
0 siblings, 2 replies; 10+ messages in thread
From: linux @ 2007-03-22 12:38 UTC (permalink / raw)
To: linux-ide, linux-kernel; +Cc: axboe, linux
This is a uniprocessor AMD64 system running software RAID-5 and RAID-10
over multiple PCIe SiI3132 SATA controllers. The hardware has been very
stable for a long time, but has been acting up of late since I upgraded
to 2.6.20.3. ECC memory should preclude the possibility of bit-flip
errors.
Kernel 2.6.20.3 + linuxpps patches (confined to drivers/serial, and not
actually in use as I stole the serial port for a console).
It takes half a day to reproduce the problem, so bisecting would be painful.
BackupPC_dump mostly writes to a large (1.7 TB) ext3 RAID5 partition.
Here are two oopes, a few minutes (16:31, to be precise) apart.
Unusually, it oopsed twice *without* locking up the system.. Usually,
I see this followed by an error from drivers/input/keyboard/atkbd.c:
printk(KERN_WARNING "atkbd.c: Spurious %s on %s. "
"Some program might be trying access hardware directly.\n",
emitted at 1 Hz with the keyboard LEDs flashing and the system
unresponsive to keyboard or pings.
(I think it was spurious ACK on serio/input0, but my memory may be faulty.)
If anyone has any suggestions, they'd be gratefully received.
Unable to handle kernel NULL pointer dereference at 0000000000000098 RIP:
[<ffffffff8031504a>] cfq_dispatch_insert+0x18/0x68
PGD 777e9067 PUD 78774067 PMD 0
Oops: 0000 [1]
CPU 0
Modules linked in: ecb
Pid: 2837, comm: BackupPC_dump Not tainted 2.6.20.3-g691f5333 #40
RIP: 0010:[<ffffffff8031504a>] [<ffffffff8031504a>] cfq_dispatch_insert+0x18/0x68
RSP: 0018:ffff8100770bbaf8 EFLAGS: 00010092
RAX: ffff81007fb36c80 RBX: 0000000000000000 RCX: 0000000000000001
RDX: 000000010003e4e7 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff81007fb37a00 R08: 00000000ffffffff R09: ffff81005d390298
R10: ffff81007fcb4f80 R11: ffff81007fcb4f80 R12: ffff81007facd280
R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000000
FS: 00002b322d120d30(0000) GS:ffffffff805de000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000098 CR3: 000000007bcf0000 CR4: 00000000000006e0
Process BackupPC_dump (pid: 2837, threadinfo ffff8100770ba000, task ffff81007fc5d8e0)
Stack: 0000000000000000 ffff8100770f39f0 0000000000000000 0000000000000004
0000000000000001 ffffffff80315253 ffffffff803b2607 ffff81005da2bc40
ffff81007fac3800 ffff81007facd280 ffff81007facd280 ffff81005d390298
Call Trace:
[<ffffffff80315253>] cfq_dispatch_requests+0x152/0x512
[<ffffffff803b2607>] scsi_done+0x0/0x18
[<ffffffff8030d9f1>] elv_next_request+0x137/0x147
[<ffffffff803b7ce0>] scsi_request_fn+0x6a/0x33a
[<ffffffff8024d407>] generic_unplug_device+0xa/0xe
[<ffffffff80407ced>] unplug_slaves+0x5b/0x94
[<ffffffff80223d65>] sync_page+0x0/0x40
[<ffffffff80223d9b>] sync_page+0x36/0x40
[<ffffffff80256d45>] __wait_on_bit_lock+0x36/0x65
[<ffffffff80237496>] __lock_page+0x5e/0x64
[<ffffffff8028061d>] wake_bit_function+0x0/0x23
[<ffffffff802074de>] find_get_page+0xe/0x2d
[<ffffffff8020b38e>] do_generic_mapping_read+0x1c2/0x40d
[<ffffffff8020bd80>] file_read_actor+0x0/0x118
[<ffffffff8021422e>] generic_file_aio_read+0x15c/0x19e
[<ffffffff8020bafa>] do_sync_read+0xc9/0x10c
[<ffffffff80210342>] may_open+0x5b/0x1c6
[<ffffffff802805ef>] autoremove_wake_function+0x0/0x2e
[<ffffffff8020a857>] vfs_read+0xaa/0x152
[<ffffffff8020faf3>] sys_read+0x45/0x6e
[<ffffffff8025041e>] system_call+0x7e/0x83
Code: 4c 8b ae 98 00 00 00 4c 8b 70 08 e8 63 fe ff ff 8b 43 28 4c
RIP [<ffffffff8031504a>] cfq_dispatch_insert+0x18/0x68
RSP <ffff8100770bbaf8>
CR2: 0000000000000098
<1>Unable to handle kernel NULL pointer dereference at 0000000000000098 RIP:
[<ffffffff8031504a>] cfq_dispatch_insert+0x18/0x68
PGD 79bd2067 PUD 789f9067 PMD 0
Oops: 0000 [2]
CPU 0
Modules linked in: ecb
Pid: 2834, comm: BackupPC_dump Not tainted 2.6.20.3-g691f5333 #40
RIP: 0010:[<ffffffff8031504a>] [<ffffffff8031504a>] cfq_dispatch_insert+0x18/0x
68
RSP: 0018:ffff8100789b5af8 EFLAGS: 00010092
RAX: ffff81007fb36c80 RBX: 0000000000000000 RCX: 0000000000000001
RDX: 000000010007ac16 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff81007fb37a00 R08: 00000000ffffffff R09: ffff810064dd45e0
R10: ffff81007fcb4f80 R11: ffff81007fcb4f80 R12: ffff81007facd280
R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000000
FS: 00002b0a7c680d30(0000) GS:ffffffff805de000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000098 CR3: 0000000079d36000 CR4: 00000000000006e0
Process BackupPC_dump (pid: 2834, threadinfo ffff8100789b4000, task ffff81007a23
5140)
Stack: 0000000000000000 ffff81007b9ebbd0 0000000000000000 0000000000000004
0000000000000001 ffffffff80315253 ffffffff803b2607 ffff81000e67ba00
ffff81007fac3800 ffff81007facd280 ffff81007facd280 ffff810064dd45e0
Call Trace:
[<ffffffff80315253>] cfq_dispatch_requests+0x152/0x512
[<ffffffff803b2607>] scsi_done+0x0/0x18
[<ffffffff8030d9f1>] elv_next_request+0x137/0x147
[<ffffffff803b7ce0>] scsi_request_fn+0x6a/0x33a
[<ffffffff8024d407>] generic_unplug_device+0xa/0xe
[<ffffffff80407ced>] unplug_slaves+0x5b/0x94
[<ffffffff80223d65>] sync_page+0x0/0x40
[<ffffffff80223d9b>] sync_page+0x36/0x40
[<ffffffff80256d45>] __wait_on_bit_lock+0x36/0x65
[<ffffffff80237496>] __lock_page+0x5e/0x64
[<ffffffff8028061d>] wake_bit_function+0x0/0x23
[<ffffffff802074de>] find_get_page+0xe/0x2d
[<ffffffff8020b38e>] do_generic_mapping_read+0x1c2/0x40d
[<ffffffff8020bd80>] file_read_actor+0x0/0x118
[<ffffffff8021422e>] generic_file_aio_read+0x15c/0x19e
[<ffffffff8020bafa>] do_sync_read+0xc9/0x10c
[<ffffffff80210342>] may_open+0x5b/0x1c6
[<ffffffff802805ef>] autoremove_wake_function+0x0/0x2e
[<ffffffff8020a857>] vfs_read+0xaa/0x152
[<ffffffff8020faf3>] sys_read+0x45/0x6e
[<ffffffff8025041e>] system_call+0x7e/0x83
Code: 4c 8b ae 98 00 00 00 4c 8b 70 08 e8 63 fe ff ff 8b 43 28 4c
RIP [<ffffffff8031504a>] cfq_dispatch_insert+0x18/0x68
RSP <ffff8100789b5af8>
CR2: 0000000000000098
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.20.3 AMD64 oops in CFQ code
2007-03-22 12:38 2.6.20.3 AMD64 oops in CFQ code linux
@ 2007-03-22 18:41 ` Jens Axboe
2007-03-22 18:43 ` Aristeu Sergio Rozanski Filho
1 sibling, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2007-03-22 18:41 UTC (permalink / raw)
To: linux; +Cc: linux-ide, linux-kernel
On Thu, Mar 22 2007, linux@horizon.com wrote:
> This is a uniprocessor AMD64 system running software RAID-5 and RAID-10
> over multiple PCIe SiI3132 SATA controllers. The hardware has been very
> stable for a long time, but has been acting up of late since I upgraded
> to 2.6.20.3. ECC memory should preclude the possibility of bit-flip
> errors.
>
> Kernel 2.6.20.3 + linuxpps patches (confined to drivers/serial, and not
> actually in use as I stole the serial port for a console).
>
> It takes half a day to reproduce the problem, so bisecting would be painful.
>
> BackupPC_dump mostly writes to a large (1.7 TB) ext3 RAID5 partition.
>
>
> Here are two oopes, a few minutes (16:31, to be precise) apart.
> Unusually, it oopsed twice *without* locking up the system.. Usually,
> I see this followed by an error from drivers/input/keyboard/atkbd.c:
> printk(KERN_WARNING "atkbd.c: Spurious %s on %s. "
> "Some program might be trying access hardware directly.\n",
> emitted at 1 Hz with the keyboard LEDs flashing and the system
> unresponsive to keyboard or pings.
> (I think it was spurious ACK on serio/input0, but my memory may be faulty.)
>
>
> If anyone has any suggestions, they'd be gratefully received.
>
>
> Unable to handle kernel NULL pointer dereference at 0000000000000098 RIP:
> [<ffffffff8031504a>] cfq_dispatch_insert+0x18/0x68
> PGD 777e9067 PUD 78774067 PMD 0
> Oops: 0000 [1]
> CPU 0
> Modules linked in: ecb
> Pid: 2837, comm: BackupPC_dump Not tainted 2.6.20.3-g691f5333 #40
> RIP: 0010:[<ffffffff8031504a>] [<ffffffff8031504a>] cfq_dispatch_insert+0x18/0x68
> RSP: 0018:ffff8100770bbaf8 EFLAGS: 00010092
> RAX: ffff81007fb36c80 RBX: 0000000000000000 RCX: 0000000000000001
> RDX: 000000010003e4e7 RSI: 0000000000000000 RDI: 0000000000000000
> RBP: ffff81007fb37a00 R08: 00000000ffffffff R09: ffff81005d390298
> R10: ffff81007fcb4f80 R11: ffff81007fcb4f80 R12: ffff81007facd280
> R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000000
> FS: 00002b322d120d30(0000) GS:ffffffff805de000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000098 CR3: 000000007bcf0000 CR4: 00000000000006e0
> Process BackupPC_dump (pid: 2837, threadinfo ffff8100770ba000, task ffff81007fc5d8e0)
> Stack: 0000000000000000 ffff8100770f39f0 0000000000000000 0000000000000004
> 0000000000000001 ffffffff80315253 ffffffff803b2607 ffff81005da2bc40
> ffff81007fac3800 ffff81007facd280 ffff81007facd280 ffff81005d390298
> Call Trace:
> [<ffffffff80315253>] cfq_dispatch_requests+0x152/0x512
> [<ffffffff803b2607>] scsi_done+0x0/0x18
> [<ffffffff8030d9f1>] elv_next_request+0x137/0x147
> [<ffffffff803b7ce0>] scsi_request_fn+0x6a/0x33a
> [<ffffffff8024d407>] generic_unplug_device+0xa/0xe
> [<ffffffff80407ced>] unplug_slaves+0x5b/0x94
> [<ffffffff80223d65>] sync_page+0x0/0x40
> [<ffffffff80223d9b>] sync_page+0x36/0x40
> [<ffffffff80256d45>] __wait_on_bit_lock+0x36/0x65
> [<ffffffff80237496>] __lock_page+0x5e/0x64
> [<ffffffff8028061d>] wake_bit_function+0x0/0x23
> [<ffffffff802074de>] find_get_page+0xe/0x2d
> [<ffffffff8020b38e>] do_generic_mapping_read+0x1c2/0x40d
> [<ffffffff8020bd80>] file_read_actor+0x0/0x118
> [<ffffffff8021422e>] generic_file_aio_read+0x15c/0x19e
> [<ffffffff8020bafa>] do_sync_read+0xc9/0x10c
> [<ffffffff80210342>] may_open+0x5b/0x1c6
> [<ffffffff802805ef>] autoremove_wake_function+0x0/0x2e
> [<ffffffff8020a857>] vfs_read+0xaa/0x152
> [<ffffffff8020faf3>] sys_read+0x45/0x6e
> [<ffffffff8025041e>] system_call+0x7e/0x83
3 (I think) seperate instances of this, each involving raid5. Is your
array degraded or fully operational?
--
Jens Axboe
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.20.3 AMD64 oops in CFQ code
2007-03-22 12:38 2.6.20.3 AMD64 oops in CFQ code linux
2007-03-22 18:41 ` Jens Axboe
@ 2007-03-22 18:43 ` Aristeu Sergio Rozanski Filho
1 sibling, 0 replies; 10+ messages in thread
From: Aristeu Sergio Rozanski Filho @ 2007-03-22 18:43 UTC (permalink / raw)
To: linux; +Cc: linux-ide, linux-kernel, axboe
> This is a uniprocessor AMD64 system running software RAID-5 and RAID-10
> over multiple PCIe SiI3132 SATA controllers. The hardware has been very
> stable for a long time, but has been acting up of late since I upgraded
> to 2.6.20.3. ECC memory should preclude the possibility of bit-flip
> errors.
Tried checking the memory with memtest86?
Do you have k8_edac module loaded?
If you don't, I'd recomend using it to get reports of
recoverable/unrecoverable memory errors, check http://bluesmoke.sf.net/ for
latest version.
--
Aristeu
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.20.3 AMD64 oops in CFQ code
[not found] <17923.8970.917375.917772@notabene.brown>
@ 2007-03-23 17:46 ` linux
2007-04-03 5:49 ` Tejun Heo
0 siblings, 1 reply; 10+ messages in thread
From: linux @ 2007-03-23 17:46 UTC (permalink / raw)
To: dan.j.williams, linux-ide, neilb
Cc: cebbert, jens.axboe, linux-kernel, linux-kernel, linux-raid,
linux
As an additional data point, here's a libata problem I'm having trying to
rebuild the array.
I have six identical 400 GB drives (ST3400832AS), and one is giving
me hassles. I've run SMART short and long diagnostics, badblocks, and
Seagate's "seatools" diagnostic software, and none of these find problems.
It is the only one of the six with a non-zero reallocated sector count
(it's 26).
Anyway, the drive is partitioned into a 45G RAID-10 part and a 350G RAID-5
part. The RAID-10 part integrated successfully, but the RAID-5 got to
about 60% and then puked:
ata5.00: exception Emask 0x0 SAct 0x1ef SErr 0x0 action 0x2 frozen
ata5.00: cmd 61/c0:00:d2:d0:b9/00:00:1c:00:00/40 tag 0 cdb 0x0 data 98304 out
res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/40:08:92:d1:b9/00:00:1c:00:00/40 tag 1 cdb 0x0 data 32768 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/00:10:d2:d1:b9/01:00:1c:00:00/40 tag 2 cdb 0x0 data 131072 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/00:18:d2:d2:b9/01:00:1c:00:00/40 tag 3 cdb 0x0 data 131072 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/00:28:d2:d3:b9/01:00:1c:00:00/40 tag 5 cdb 0x0 data 131072 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/00:30:d2:d4:b9/01:00:1c:00:00/40 tag 6 cdb 0x0 data 131072 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/00:38:d2:d5:b9/01:00:1c:00:00/40 tag 7 cdb 0x0 data 131072 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/00:40:d2:d6:b9/01:00:1c:00:00/40 tag 8 cdb 0x0 data 131072 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5: soft resetting port
ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata5.00: configured for UDMA/100
ata5: EH complete
SCSI device sde: 781422768 512-byte hdwr sectors (400088 MB)
sde: Write Protect is off
SCSI device sde: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata5.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0
res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5: soft resetting port
ata5: softreset failed (timeout)
ata5: softreset failed, retrying in 5 secs
ata5: hard resetting port
ata5: softreset failed (timeout)
ata5: follow-up softreset failed, retrying in 5 secs
ata5: hard resetting port
ata5: softreset failed (timeout)
ata5: reset failed, giving up
ata5.00: disabled
ata5: EH complete
sd 4:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sde, sector 91795259
md: super_written gets error=-5, uptodate=0
raid10: Disk failure on sde3, disabling device.
Operation continuing on 5 devices
sd 4:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sde, sector 481942994
raid5: Disk failure on sde4, disabling device. Operation continuing on 5 devices
sd 4:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sde, sector 481944018
md: md5: recovery done.
RAID10 conf printout:
--- wd:5 rd:6
disk 0, wo:0, o:1, dev:sdb3
disk 1, wo:0, o:1, dev:sdc3
disk 2, wo:0, o:1, dev:sdd3
disk 3, wo:1, o:0, dev:sde3
disk 4, wo:0, o:1, dev:sdf3
disk 5, wo:0, o:1, dev:sda3
RAID10 conf printout:
--- wd:5 rd:6
disk 0, wo:0, o:1, dev:sdb3
disk 1, wo:0, o:1, dev:sdc3
disk 2, wo:0, o:1, dev:sdd3
disk 4, wo:0, o:1, dev:sdf3
disk 5, wo:0, o:1, dev:sda3
RAID5 conf printout:
--- rd:6 wd:5
disk 0, o:1, dev:sda4
disk 1, o:1, dev:sdb4
disk 2, o:1, dev:sdc4
disk 3, o:1, dev:sdd4
disk 4, o:0, dev:sde4
disk 5, o:1, dev:sdf4
RAID5 conf printout:
--- rd:6 wd:5
disk 0, o:1, dev:sda4
disk 1, o:1, dev:sdb4
disk 2, o:1, dev:sdc4
disk 3, o:1, dev:sdd4
disk 5, o:1, dev:sdf4
The first error address is just barely inside the RAID-10 part (which ends at
sector 91,795,410), while the second and third errors (at 481,942,994)
look like where the reconstruction was working.
Anyway, what's annoying is that I can't figure out how to bring the
drive back on line without resetting the box. It's in a hot-swap enclosure,
but power cycling the drive doesn't seem to help. I thought libata hotplug
was working? (SiI3132 card, using the sil24 driver.)
(H'm... after rebooting, reallocated sectors jumped from 26 to 39.
Something is up with that drive.)
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.20.3 AMD64 oops in CFQ code
2007-03-23 17:46 ` linux
@ 2007-04-03 5:49 ` Tejun Heo
2007-04-03 13:03 ` linux
2007-04-04 23:22 ` Bill Davidsen
0 siblings, 2 replies; 10+ messages in thread
From: Tejun Heo @ 2007-04-03 5:49 UTC (permalink / raw)
To: linux
Cc: dan.j.williams, linux-ide, neilb, cebbert, jens.axboe,
linux-kernel, linux-kernel, linux-raid
[resending. my mail service was down for more than a week and this
message didn't get delivered.]
linux@horizon.com wrote:
> > Anyway, what's annoying is that I can't figure out how to bring the
> > drive back on line without resetting the box. It's in a hot-swap
enclosure,
> > but power cycling the drive doesn't seem to help. I thought libata
hotplug
> > was working? (SiI3132 card, using the sil24 driver.)
Yeah, it's working but failing resets are considered highly dangerous
(in that the controller status is unknown and may cause something
dangerous like screaming interrupts) and port is muted after that. The
plan is to handle this with polling hotplug such that libata tries to
revive the port if PHY status change is detected by polling. Patches
are available but they need other things to resolved to get integrated.
I think it'll happen before the summer.
Anyways, you can tell libata to retry the port by manually telling it to
rescan the port (echo - - - > /sys/class/scsi_host/hostX/scan).
> > (H'm... after rebooting, reallocated sectors jumped from 26 to 39.
> > Something is up with that drive.)
Yeap, seems like a broken drive to me.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.20.3 AMD64 oops in CFQ code
2007-04-03 5:49 ` Tejun Heo
@ 2007-04-03 13:03 ` linux
2007-04-03 13:11 ` Tejun Heo
2007-04-04 23:22 ` Bill Davidsen
1 sibling, 1 reply; 10+ messages in thread
From: linux @ 2007-04-03 13:03 UTC (permalink / raw)
To: linux, teheo
Cc: cebbert, dan.j.williams, jens.axboe, linux-ide, linux-kernel,
linux-kernel, linux-raid, neilb
linux@horizon.com wrote:
>> Anyway, what's annoying is that I can't figure out how to bring the
>> drive back on line without resetting the box. It's in a hot-swap enclosure,
>> but power cycling the drive doesn't seem to help. I thought libata hotplug
>> was working? (SiI3132 card, using the sil24 driver.)
> Yeah, it's working but failing resets are considered highly dangerous
> (in that the controller status is unknown and may cause something
> dangerous like screaming interrupts) and port is muted after that. The
> plan is to handle this with polling hotplug such that libata tries to
> revive the port if PHY status change is detected by polling. Patches
> are available but they need other things to resolved to get integrated.
> I think it'll happen before the summer.
> Anyways, you can tell libata to retry the port by manually telling it to
> rescan the port (echo - - - > /sys/class/scsi_host/hostX/scan).
Ah, thank you! I have to admit, that is at least as mysterious as any
Microsoft registry tweak.
>> (H'm... after rebooting, reallocated sectors jumped from 26 to 39.
>> Something is up with that drive.)
> Yeap, seems like a broken drive to me.
Actually, after a few rounds, the reallocated sectors stabilized at 56
and all is working well again. It's like there was a major problem with
error handling.
The problem is that I don't know where the blame lies.
In case it helps, here is the syslog associated with the first failure:
07:17:56: ata5.00: exception Emask 0x0 SAct 0x7fffff SErr 0x0 action 0x2 frozen
07:20:15: ata5.00: cmd 61/30:00:a2:d8:b9/00:00:1c:00:00/40 tag 0 cdb 0x0 data 24576 out
07:20:15: res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:15: ata5.00: cmd 60/08:08:9a:7c:b9/00:00:1c:00:00/40 tag 1 cdb 0x0 data 4096 in
07:20:15: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:15: ata5.00: cmd 60/08:10:ca:7c:b9/00:00:1c:00:00/40 tag 2 cdb 0x0 data 4096 in
07:20:15: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:15: ata5.00: cmd 60/20:18:b2:d6:b9/00:00:1c:00:00/40 tag 3 cdb 0x0 data 16384 in
07:20:15: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:15: ata5.00: cmd 61/08:20:fa:a2:b6/00:00:1c:00:00/40 tag 4 cdb 0x0 data 4096 out
07:20:15: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:15: ata5.00: cmd 60/40:28:1a:d7:b9/00:00:1c:00:00/40 tag 5 cdb 0x0 data 32768 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/00:30:d2:11:c0/01:00:24:00:00/40 tag 6 cdb 0x0 data 131072 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/28:38:2a:0b:b9/00:00:1c:00:00/40 tag 7 cdb 0x0 data 20480 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/08:40:82:7c:b9/00:00:1c:00:00/40 tag 8 cdb 0x0 data 4096 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 61/58:48:02:a3:b6/00:00:1c:00:00/40 tag 9 cdb 0x0 data 45056 out
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/10:50:52:49:ba/00:00:1c:00:00/40 tag 10 cdb 0x0 data 8192 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/08:58:6a:49:ba/00:00:1c:00:00/40 tag 11 cdb 0x0 data 4096 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/08:60:7a:49:ba/00:00:1c:00:00/40 tag 12 cdb 0x0 data 4096 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/10:68:8a:49:ba/00:00:1c:00:00/40 tag 13 cdb 0x0 data 8192 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/08:70:a2:49:ba/00:00:1c:00:00/40 tag 14 cdb 0x0 data 4096 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/08:78:ba:49:ba/00:00:1c:00:00/40 tag 15 cdb 0x0 data 4096 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/18:80:0a:b0:b8/00:00:1c:00:00/40 tag 16 cdb 0x0 data 12288 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/10:88:2a:b0:b8/00:00:1c:00:00/40 tag 17 cdb 0x0 data 8192 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/08:90:52:0a:b9/00:00:1c:00:00/40 tag 18 cdb 0x0 data 4096 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 61/40:98:d2:d6:b9/00:00:1c:00:00/40 tag 19 cdb 0x0 data 32768 out
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 61/40:a0:92:d7:b9/00:00:1c:00:00/40 tag 20 cdb 0x0 data 32768 out
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 61/50:a8:52:d8:b9/00:00:1c:00:00/40 tag 21 cdb 0x0 data 40960 out
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 61/78:b0:da:d7:b9/00:00:1c:00:00/40 tag 22 cdb 0x0 data 61440 out
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5: soft resetting port
07:20:16: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
07:20:16: ata5.00: configured for UDMA/100
07:20:16: ata5: EH complete
07:20:16: SCSI device sde: 781422768 512-byte hdwr sectors (400088 MB)
07:20:16: sde: Write Protect is off
07:20:16: sde: Mode Sense: 00 3a 00 00
07:20:16: SCSI device sde: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
07:20:16: ata5.00: exception Emask 0x0 SAct 0xb0000 SErr 0x0 action 0x2 frozen
07:20:16: ata5.00: cmd 60/00:80:d2:11:c0/01:00:24:00:00/40 tag 16 cdb 0x0 data 131072 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/40:88:1a:d7:b9/00:00:1c:00:00/40 tag 17 cdb 0x0 data 32768 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/20:98:b2:d6:b9/00:00:1c:00:00/40 tag 19 cdb 0x0 data 16384 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5: soft resetting port
07:20:16: ata5: softreset failed (timeout)
07:20:16: ata5: softreset failed, retrying in 5 secs
07:20:16: ata5: hard resetting port
07:20:16: ata5: softreset failed (timeout)
07:20:16: ata5: follow-up softreset failed, retrying in 5 secs
07:20:16: ata5: hard resetting port
07:20:16: ata5: softreset failed (timeout)
07:20:16: ata5: reset failed, giving up
07:20:16: ata5.00: disabled
07:20:16: ata5: EH complete
07:20:16: sd 4:0:0:0: SCSI error: return code = 0x00040000
07:20:16: end_request: I/O error, dev sde, sector 481941170
07:20:16: sd 4:0:0:0: SCSI error: return code = 0x00040000
07:20:16: end_request: I/O error, dev sde, sector 481941274
07:20:16: sd 4:0:0:0: SCSI error: return code = 0x00040000
07:20:16: end_request: I/O error, dev sde, sector 616567250
07:20:16: sd 4:0:0:0: SCSI error: return code = 0x00040000
07:20:16: end_request: I/O error, dev sde, sector 779457538
07:20:16: md: super_written gets error=-5, uptodate=0
07:20:16: raid5: Disk failure on sde4, disabling device. Operation continuing on 5 devices
07:20:16: sd 4:0:0:0: SCSI error: return code = 0x00040000
07:20:16: end_request: I/O error, dev sde, sector 91795259
07:20:16: md: super_written gets error=-5, uptodate=0
07:20:16: raid10: Disk failure on sde3, disabling device.
07:20:16: ^IOperation continuing on 5 devices
07:20:16: RAID5 conf printout:
07:20:16: --- rd:6 wd:5
07:20:16: disk 0, o:1, dev:sda4
07:20:16: disk 1, o:1, dev:sdb4
07:20:16: disk 2, o:1, dev:sdc4
07:20:16: disk 3, o:1, dev:sdd4
07:20:16: disk 4, o:0, dev:sde4
07:20:16: disk 5, o:1, dev:sdf4
07:20:16: RAID5 conf printout:
07:20:16: --- rd:6 wd:5
07:20:16: disk 0, o:1, dev:sda4
07:20:16: disk 1, o:1, dev:sdb4
07:20:16: disk 2, o:1, dev:sdc4
07:20:16: disk 3, o:1, dev:sdd4
07:20:16: disk 5, o:1, dev:sdf4
07:20:16: RAID10 conf printout:
07:20:16: --- wd:5 rd:6
07:20:16: disk 0, wo:0, o:1, dev:sdb3
07:20:16: disk 1, wo:0, o:1, dev:sdc3
07:20:16: disk 2, wo:0, o:1, dev:sdd3
07:20:16: disk 3, wo:1, o:0, dev:sde3
07:20:16: disk 4, wo:0, o:1, dev:sdf3
07:20:16: disk 5, wo:0, o:1, dev:sda3
07:20:16: RAID10 conf printout:
07:20:16: --- wd:5 rd:6
07:20:16: disk 0, wo:0, o:1, dev:sdb3
07:20:16: disk 1, wo:0, o:1, dev:sdc3
07:20:16: disk 2, wo:0, o:1, dev:sdd3
07:20:16: disk 4, wo:0, o:1, dev:sdf3
07:20:16: disk 5, wo:0, o:1, dev:sda3
And another one while recovering. I notice that the reset timeouts seem
to take much longer this time. Or could the above have had similar
timeouts, but syslog was blocked by the error and didn't receive and
timestamp the messages until recovery was complete?
14:49:55: RAID5 conf printout:
14:49:55: --- rd:6 wd:5
14:49:55: disk 0, o:1, dev:sda4
14:49:55: disk 1, o:1, dev:sdb4
14:49:55: disk 2, o:1, dev:sdc4
14:49:55: disk 3, o:1, dev:sdd4
14:49:55: disk 4, o:1, dev:sde4
14:49:55: disk 5, o:1, dev:sdf4
14:49:55: md: recovery of RAID array md5
14:49:55: md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
14:49:55: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
14:49:55: md: using 128k window, over a total of 343831040 blocks.
14:56:13: ata5.00: exception Emask 0x0 SAct 0xff SErr 0x0 action 0x2 frozen
14:56:13: ata5.00: cmd 61/f8:00:da:d6:b9/00:00:1c:00:00/40 tag 0 cdb 0x0 data 126976 out
14:56:13: res 40/00:01:09:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
14:56:13: ata5.00: cmd 61/90:08:d2:2f:ba/00:00:1c:00:00/40 tag 1 cdb 0x0 data 73728 out
14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
14:56:13: ata5.00: cmd 61/70:10:62:30:ba/01:00:1c:00:00/40 tag 2 cdb 0x0 data 188416 out
14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
14:56:13: ata5.00: cmd 61/00:18:d2:31:ba/01:00:1c:00:00/40 tag 3 cdb 0x0 data 131072 out
14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
14:56:13: ata5.00: cmd 61/00:20:d2:32:ba/01:00:1c:00:00/40 tag 4 cdb 0x0 data 131072 out
14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
14:56:13: ata5.00: cmd 61/00:28:d2:33:ba/01:00:1c:00:00/40 tag 5 cdb 0x0 data 131072 out
14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
14:56:13: ata5.00: cmd 61/00:30:d2:34:ba/01:00:1c:00:00/40 tag 6 cdb 0x0 data 131072 out
14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
14:56:13: ata5.00: cmd 61/00:38:d2:35:ba/01:00:1c:00:00/40 tag 7 cdb 0x0 data 131072 out
14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
14:56:13: ata5: soft resetting port
14:56:43: ata5: softreset failed (timeout)
14:56:43: ata5: softreset failed, retrying in 5 secs
14:56:48: ata5: hard resetting port
14:57:20: ata5: softreset failed (timeout)
14:57:20: ata5: follow-up softreset failed, retrying in 5 secs
14:57:25: ata5: hard resetting port
14:57:58: ata5: softreset failed (timeout)
14:57:58: ata5: reset failed, giving up
14:57:58: ata5.00: disabled
14:57:58: ata5: EH complete
14:57:58: sd 4:0:0:0: SCSI error: return code = 0x00040000
14:57:58: end_request: I/O error, dev sde, sector 481965522
14:57:58: raid5: Disk failure on sde4, disabling device. Operation continuing on 5 devices
14:57:58: sd 4:0:0:0: SCSI error: return code = 0x00040000
14:57:58: end_request: I/O error, dev sde, sector 481965266
14:57:58: sd 4:0:0:0: SCSI error: return code = 0x00040000
14:57:58: end_request: I/O error, dev sde, sector 481965010
14:57:58: sd 4:0:0:0: SCSI error: return code = 0x00040000
14:57:58: end_request: I/O error, dev sde, sector 481964754
14:57:58: sd 4:0:0:0: SCSI error: return code = 0x00040000
14:57:58: end_request: I/O error, dev sde, sector 481964498
14:57:58: sd 4:0:0:0: SCSI error: return code = 0x00040000
14:57:58: end_request: I/O error, dev sde, sector 481964130
14:57:58: sd 4:0:0:0: SCSI error: return code = 0x00040000
14:57:58: end_request: I/O error, dev sde, sector 481963986
14:57:58: sd 4:0:0:0: SCSI error: return code = 0x00040000
14:57:58: end_request: I/O error, dev sde, sector 481941210
14:57:58: md: md5: recovery done.
14:57:58: RAID5 conf printout:
14:57:58: --- rd:6 wd:5
14:57:58: disk 0, o:1, dev:sda4
14:57:58: disk 1, o:1, dev:sdb4
14:57:58: disk 2, o:1, dev:sdc4
14:57:58: disk 3, o:1, dev:sdd4
14:57:58: disk 4, o:0, dev:sde4
14:57:58: disk 5, o:1, dev:sdf4
14:57:58: RAID5 conf printout:
14:57:58: --- rd:6 wd:5
14:57:58: disk 0, o:1, dev:sda4
14:57:58: disk 1, o:1, dev:sdb4
14:57:58: disk 2, o:1, dev:sdc4
14:57:58: disk 3, o:1, dev:sdd4
14:57:58: disk 5, o:1, dev:sdf4
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.20.3 AMD64 oops in CFQ code
2007-04-03 13:03 ` linux
@ 2007-04-03 13:11 ` Tejun Heo
0 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2007-04-03 13:11 UTC (permalink / raw)
To: linux
Cc: cebbert, dan.j.williams, jens.axboe, linux-ide, linux-kernel,
linux-kernel, linux-raid, neilb
linux@horizon.com wrote:
> linux@horizon.com wrote:
>>> Anyway, what's annoying is that I can't figure out how to bring the
>>> drive back on line without resetting the box. It's in a hot-swap enclosure,
>>> but power cycling the drive doesn't seem to help. I thought libata hotplug
>>> was working? (SiI3132 card, using the sil24 driver.)
>
>> Yeah, it's working but failing resets are considered highly dangerous
>> (in that the controller status is unknown and may cause something
>> dangerous like screaming interrupts) and port is muted after that. The
>> plan is to handle this with polling hotplug such that libata tries to
>> revive the port if PHY status change is detected by polling. Patches
>> are available but they need other things to resolved to get integrated.
>> I think it'll happen before the summer.
>
>> Anyways, you can tell libata to retry the port by manually telling it to
>> rescan the port (echo - - - > /sys/class/scsi_host/hostX/scan).
>
> Ah, thank you! I have to admit, that is at least as mysterious as any
> Microsoft registry tweak.
Polling hotplug should fix this. I thought I would be able to merge it
much earlier. I apparently was way too optimistic. :-(
>>> (H'm... after rebooting, reallocated sectors jumped from 26 to 39.
>>> Something is up with that drive.)
>
>> Yeap, seems like a broken drive to me.
>
> Actually, after a few rounds, the reallocated sectors stabilized at 56
> and all is working well again. It's like there was a major problem with
> error handling.
>
> The problem is that I don't know where the blame lies.
I'm pretty sure it's the firmware's fault. It's not supposed to go out
for lunch like that even when internal error occurs.
--
tejun
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.20.3 AMD64 oops in CFQ code
2007-04-03 5:49 ` Tejun Heo
2007-04-03 13:03 ` linux
@ 2007-04-04 23:22 ` Bill Davidsen
2007-04-05 4:13 ` Lee Revell
1 sibling, 1 reply; 10+ messages in thread
From: Bill Davidsen @ 2007-04-04 23:22 UTC (permalink / raw)
To: Tejun Heo
Cc: linux, dan.j.williams, linux-ide, neilb, cebbert, jens.axboe,
linux-kernel, linux-kernel, linux-raid
Tejun Heo wrote:
> [resending. my mail service was down for more than a week and this
> message didn't get delivered.]
>
> linux@horizon.com wrote:
>
>>> Anyway, what's annoying is that I can't figure out how to bring the
>>> drive back on line without resetting the box. It's in a hot-swap
>>>
> enclosure,
>
>>> but power cycling the drive doesn't seem to help. I thought libata
>>>
> hotplug
>
>>> was working? (SiI3132 card, using the sil24 driver.)
>>>
>
> Yeah, it's working but failing resets are considered highly dangerous
> (in that the controller status is unknown and may cause something
> dangerous like screaming interrupts) and port is muted after that. The
> plan is to handle this with polling hotplug such that libata tries to
> revive the port if PHY status change is detected by polling. Patches
> are available but they need other things to resolved to get integrated.
> I think it'll happen before the summer.
>
> Anyways, you can tell libata to retry the port by manually telling it to
> rescan the port (echo - - - > /sys/class/scsi_host/hostX/scan).
>
I won't say that's voodoo, but if I ever did it I'd wipe down my
keyboard with holy water afterward. ;-)
Well, I did save the message in my tricks file, but it sounds like a
last ditch effort after something get very wrong.
--
bill davidsen <davidsen@tmr.com>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.20.3 AMD64 oops in CFQ code
2007-04-04 23:22 ` Bill Davidsen
@ 2007-04-05 4:13 ` Lee Revell
2007-04-05 4:29 ` Tejun Heo
0 siblings, 1 reply; 10+ messages in thread
From: Lee Revell @ 2007-04-05 4:13 UTC (permalink / raw)
To: Bill Davidsen
Cc: Tejun Heo, linux, dan.j.williams, linux-ide, neilb, cebbert,
jens.axboe, linux-kernel, linux-kernel, linux-raid
On 4/4/07, Bill Davidsen <davidsen@tmr.com> wrote:
> I won't say that's voodoo, but if I ever did it I'd wipe down my
> keyboard with holy water afterward. ;-)
>
> Well, I did save the message in my tricks file, but it sounds like a
> last ditch effort after something get very wrong.
Would it reallty be an impediment to development if the kernel
maintainers simply refuse to merge patches that add new sysfs entries
without corresponding documentation?
Lee
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: 2.6.20.3 AMD64 oops in CFQ code
2007-04-05 4:13 ` Lee Revell
@ 2007-04-05 4:29 ` Tejun Heo
0 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2007-04-05 4:29 UTC (permalink / raw)
To: Lee Revell
Cc: Bill Davidsen, linux, dan.j.williams, linux-ide, neilb, cebbert,
jens.axboe, linux-kernel, linux-kernel, linux-raid
Lee Revell wrote:
> On 4/4/07, Bill Davidsen <davidsen@tmr.com> wrote:
>> I won't say that's voodoo, but if I ever did it I'd wipe down my
>> keyboard with holy water afterward. ;-)
>>
>> Well, I did save the message in my tricks file, but it sounds like a
>> last ditch effort after something get very wrong.
Which actually is true. ATA ports failing to reset indicate something
is very wrong. Either the attached device or the controller is broken
and libata shuts down the port to protect the rest of the system from
it. The manual scan requests tell libata to give it one more shot and
polling hotplug can do that automatically. Anyways, this shouldn't
happen unless you have a broken piece of hardware.
> Would it reallty be an impediment to development if the kernel
> maintainers simply refuse to merge patches that add new sysfs entries
> without corresponding documentation?
SCSI host scan nodes have been there for a long time. I think it's
documented somewhere.
--
tejun
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2007-04-05 4:29 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-22 12:38 2.6.20.3 AMD64 oops in CFQ code linux
2007-03-22 18:41 ` Jens Axboe
2007-03-22 18:43 ` Aristeu Sergio Rozanski Filho
[not found] <17923.8970.917375.917772@notabene.brown>
2007-03-23 17:46 ` linux
2007-04-03 5:49 ` Tejun Heo
2007-04-03 13:03 ` linux
2007-04-03 13:11 ` Tejun Heo
2007-04-04 23:22 ` Bill Davidsen
2007-04-05 4:13 ` Lee Revell
2007-04-05 4:29 ` Tejun Heo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).