* Controller failing, driver not behaving nicely
@ 2006-06-20 21:02 Christian Iversen
2006-06-20 22:47 ` adam radford
0 siblings, 1 reply; 3+ messages in thread
From: Christian Iversen @ 2006-06-20 21:02 UTC (permalink / raw)
To: linux-scsi
Hello all.
I have a 3Ware 5800 8-port ATA controller running on the 3w-xxxx driver, which
works nicely for the most part.
However, recently the controller has been flaky - it keeps losing all sync
with the machine, then coughs and dies. After a hard reset it works for some
time again.
I was hoping you could tell me if there is anything I should check? Also, I'm
wondering is the driver is behaving correctly? Shouldn't it try to reset the
card?
Anyway, here are the details:
Kernel: anything from 2.6.10-custom to 2.6.15-debian-sarge-stock
Arch: tested on AMD x86 SMP and UP, with and without highmem
Here's the log output just before the thing goes "boink":
<LOG>
Jun 20 07:34:00 [kernel] 3w-xxxx: scsi2: WARNING: Unit #4: Command (0x28)
timed out, resetting card.
Jun 20 07:34:30 [kernel] 3w-xxxx: scsi2: AEN drain failed, retrying.
- Last output repeated 2 times -
Jun 20 07:35:30 [kernel] RAID5 conf printout:
- Last output repeated 7 times -
Jun 20 07:35:30 [kernel] Buffer I/O error on device md1, logical block
56274327
Jun 20 07:35:35 [kernel] Buffer I/O error on device md1, logical block
56859486
Jun 20 07:35:37 [kernel] Buffer I/O error on device md1, logical block
117796200
Jun 20 07:35:42 [kernel] printk: 1 messages suppressed.
Jun 20 07:35:46 [kernel] printk: 7 messages suppressed.
Jun 20 08:41:37 [kernel] ReiserFS: md1: warning: vs-13070:
reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of
[2 84475 0x0 SD]
Jun 20 08:41:38 [kernel] ReiserFS: md1: warning: vs-13070:
reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of
[25743 118119 0x0 SD]
[lots of reiserfs faileurs]
</LOG>
Does anybody know what command 0x28 is? Maybe it's one of the 8 drives that is
broken, and the controller is not telling me?
Here's /proc/scsi/scsi
Attached devices:
Host: scsi0 Channel: 00 Id: 03 Lun: 00
Vendor: SEAGATE Model: ST336607LW Rev: 0007
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 04 Lun: 00
Vendor: SEAGATE Model: ST336607LW Rev: 0007
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi2 Channel: 00 Id: 00 Lun: 00
Vendor: 3ware Model: Logical Disk 0 Rev: 1.2
Type: Direct-Access ANSI SCSI revision: ffffffff
Host: scsi2 Channel: 00 Id: 01 Lun: 00
Vendor: 3ware Model: Logical Disk 1 Rev: 1.2
Type: Direct-Access ANSI SCSI revision: ffffffff
Host: scsi2 Channel: 00 Id: 02 Lun: 00
Vendor: 3ware Model: Logical Disk 2 Rev: 1.2
Type: Direct-Access ANSI SCSI revision: ffffffff
Host: scsi2 Channel: 00 Id: 03 Lun: 00
Vendor: 3ware Model: Logical Disk 3 Rev: 1.2
Type: Direct-Access ANSI SCSI revision: ffffffff
Host: scsi2 Channel: 00 Id: 04 Lun: 00
Vendor: 3ware Model: Logical Disk 4 Rev: 1.2
Type: Direct-Access ANSI SCSI revision: ffffffff
Host: scsi2 Channel: 00 Id: 05 Lun: 00
Vendor: 3ware Model: Logical Disk 5 Rev: 1.2
Type: Direct-Access ANSI SCSI revision: ffffffff
Host: scsi2 Channel: 00 Id: 06 Lun: 00
Vendor: 3ware Model: Logical Disk 6 Rev: 1.2
Type: Direct-Access ANSI SCSI revision: ffffffff
Host: scsi2 Channel: 00 Id: 07 Lun: 00
Vendor: 3ware Model: Logical Disk 7 Rev: 1.2
Type: Direct-Access ANSI SCSI revision: ffffffff
The two first entries are real SCSI-disks. the last 8 are the 3ware-controlled
disks, of course. The SCSI subsystem still seems to think they're connected?
I've tried the scsi-rescan-bus.sh-script, but it just agrees with /proc/scsi,
in that it thinks the drives are still connected - and so it does nothing. Is
there a utility that can kick && reconnect a scsi-device?
I'd be really interested in _any_ comments. I've ordered a couple of cheap
2-port ATA133 controllers in the meantime, which I'm going to have to use in
master-slave configuration. Oh the horror :-/
I'd be willing to test almost anything that doesn't involve erasing data on
the drives.
P.S: (recently), things often fail with this card. But not always with command
0x28:
zcat /var/log/kernel/* | grep 3w
Jun 8 08:53:42 [kernel] 3w-xxxx: scsi2: WARNING: Unit #2: Command (0x28)
timed out, resetting card.
Jun 8 08:54:12 [kernel] 3w-xxxx: scsi2: AEN drain failed, retrying.
Jun 8 08:55:42 [kernel] 3w-xxxx: scsi2: WARNING: Unit #7: Command (0x2a)
timed out, resetting card.
Jun 8 08:56:12 [kernel] 3w-xxxx: scsi2: AEN drain failed, retrying.
Jun 8 08:57:12 [kernel] 3w-xxxx: scsi2: Controller errors, card not
responding, check all cabling.
Jun 8 21:15:13 [kernel] 3w-xxxx: scsi2: WARNING: Unit #0: Command (0x12)
timed out, resetting card.
Jun 8 21:15:43 [kernel] 3w-xxxx: scsi2: AEN drain failed, retrying.
Jun 17 23:09:22 [kernel] 3w-xxxx: scsi2: AEN drain failed, retrying.
Jun 18 17:33:03 [kernel] 3w-xxxx: scsi2: AEN drain failed, retrying.
Jun 19 21:16:51 [kernel] 3w-xxxx: scsi2: AEN drain failed, retrying.
Jun 20 00:29:29 [kernel] 3w-xxxx: scsi2: AEN drain failed, retrying.
Jun 20 07:34:00 [kernel] 3w-xxxx: scsi2: WARNING: Unit #4: Command (0x28)
timed out, resetting card.
Jun 20 07:34:30 [kernel] 3w-xxxx: scsi2: AEN drain failed, retrying.
--
Regards,
Christian Iversen
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: Controller failing, driver not behaving nicely
2006-06-20 21:02 Controller failing, driver not behaving nicely Christian Iversen
@ 2006-06-20 22:47 ` adam radford
2006-06-21 1:08 ` Christian Iversen
0 siblings, 1 reply; 3+ messages in thread
From: adam radford @ 2006-06-20 22:47 UTC (permalink / raw)
To: Christian Iversen; +Cc: linux-scsi
Christian,
0x28 is the scsi opcode for READ_10, which means the command that
failed was a read command.
The driver is trying to reset the card, however it is failing the reset.
"AEN drain failed, retrying." means your card is not responding. I would
suggest reseating your card if you have moved it recently, or the card
could be dying in which case you should contact 3ware/AMCC support.
-Adam
On 6/20/06, Christian Iversen <chrivers@iversen-net.dk> wrote:
>
> Hello all.
>
> I have a 3Ware 5800 8-port ATA controller running on the 3w-xxxx driver, which
> works nicely for the most part.
>
> However, recently the controller has been flaky - it keeps losing all sync
> with the machine, then coughs and dies. After a hard reset it works for some
> time again.
>
> I was hoping you could tell me if there is anything I should check? Also, I'm
> wondering is the driver is behaving correctly? Shouldn't it try to reset the
> card?
>
> Anyway, here are the details:
>
> Kernel: anything from 2.6.10-custom to 2.6.15-debian-sarge-stock
> Arch: tested on AMD x86 SMP and UP, with and without highmem
>
> Here's the log output just before the thing goes "boink":
>
> <LOG>
> Jun 20 07:34:00 [kernel] 3w-xxxx: scsi2: WARNING: Unit #4: Command (0x28)
> timed out, resetting card.
> Jun 20 07:34:30 [kernel] 3w-xxxx: scsi2: AEN drain failed, retrying.
> - Last output repeated 2 times -
> Jun 20 07:35:30 [kernel] RAID5 conf printout:
> - Last output repeated 7 times -
> Jun 20 07:35:30 [kernel] Buffer I/O error on device md1, logical block
> 56274327
> Jun 20 07:35:35 [kernel] Buffer I/O error on device md1, logical block
> 56859486
> Jun 20 07:35:37 [kernel] Buffer I/O error on device md1, logical block
> 117796200
> Jun 20 07:35:42 [kernel] printk: 1 messages suppressed.
> Jun 20 07:35:46 [kernel] printk: 7 messages suppressed.
> Jun 20 08:41:37 [kernel] ReiserFS: md1: warning: vs-13070:
> reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of
> [2 84475 0x0 SD]
> Jun 20 08:41:38 [kernel] ReiserFS: md1: warning: vs-13070:
> reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of
> [25743 118119 0x0 SD]
> [lots of reiserfs faileurs]
> </LOG>
>
> Does anybody know what command 0x28 is? Maybe it's one of the 8 drives that is
> broken, and the controller is not telling me?
>
> Here's /proc/scsi/scsi
> Attached devices:
> Host: scsi0 Channel: 00 Id: 03 Lun: 00
> Vendor: SEAGATE Model: ST336607LW Rev: 0007
> Type: Direct-Access ANSI SCSI revision: 03
> Host: scsi0 Channel: 00 Id: 04 Lun: 00
> Vendor: SEAGATE Model: ST336607LW Rev: 0007
> Type: Direct-Access ANSI SCSI revision: 03
> Host: scsi2 Channel: 00 Id: 00 Lun: 00
> Vendor: 3ware Model: Logical Disk 0 Rev: 1.2
> Type: Direct-Access ANSI SCSI revision: ffffffff
> Host: scsi2 Channel: 00 Id: 01 Lun: 00
> Vendor: 3ware Model: Logical Disk 1 Rev: 1.2
> Type: Direct-Access ANSI SCSI revision: ffffffff
> Host: scsi2 Channel: 00 Id: 02 Lun: 00
> Vendor: 3ware Model: Logical Disk 2 Rev: 1.2
> Type: Direct-Access ANSI SCSI revision: ffffffff
> Host: scsi2 Channel: 00 Id: 03 Lun: 00
> Vendor: 3ware Model: Logical Disk 3 Rev: 1.2
> Type: Direct-Access ANSI SCSI revision: ffffffff
> Host: scsi2 Channel: 00 Id: 04 Lun: 00
> Vendor: 3ware Model: Logical Disk 4 Rev: 1.2
> Type: Direct-Access ANSI SCSI revision: ffffffff
> Host: scsi2 Channel: 00 Id: 05 Lun: 00
> Vendor: 3ware Model: Logical Disk 5 Rev: 1.2
> Type: Direct-Access ANSI SCSI revision: ffffffff
> Host: scsi2 Channel: 00 Id: 06 Lun: 00
> Vendor: 3ware Model: Logical Disk 6 Rev: 1.2
> Type: Direct-Access ANSI SCSI revision: ffffffff
> Host: scsi2 Channel: 00 Id: 07 Lun: 00
> Vendor: 3ware Model: Logical Disk 7 Rev: 1.2
> Type: Direct-Access ANSI SCSI revision: ffffffff
>
> The two first entries are real SCSI-disks. the last 8 are the 3ware-controlled
> disks, of course. The SCSI subsystem still seems to think they're connected?
> I've tried the scsi-rescan-bus.sh-script, but it just agrees with /proc/scsi,
> in that it thinks the drives are still connected - and so it does nothing. Is
> there a utility that can kick && reconnect a scsi-device?
>
> I'd be really interested in _any_ comments. I've ordered a couple of cheap
> 2-port ATA133 controllers in the meantime, which I'm going to have to use in
> master-slave configuration. Oh the horror :-/
>
> I'd be willing to test almost anything that doesn't involve erasing data on
> the drives.
>
>
> P.S: (recently), things often fail with this card. But not always with command
> 0x28:
>
>
> zcat /var/log/kernel/* | grep 3w
> Jun 8 08:53:42 [kernel] 3w-xxxx: scsi2: WARNING: Unit #2: Command (0x28)
> timed out, resetting card.
> Jun 8 08:54:12 [kernel] 3w-xxxx: scsi2: AEN drain failed, retrying.
> Jun 8 08:55:42 [kernel] 3w-xxxx: scsi2: WARNING: Unit #7: Command (0x2a)
> timed out, resetting card.
> Jun 8 08:56:12 [kernel] 3w-xxxx: scsi2: AEN drain failed, retrying.
> Jun 8 08:57:12 [kernel] 3w-xxxx: scsi2: Controller errors, card not
> responding, check all cabling.
> Jun 8 21:15:13 [kernel] 3w-xxxx: scsi2: WARNING: Unit #0: Command (0x12)
> timed out, resetting card.
> Jun 8 21:15:43 [kernel] 3w-xxxx: scsi2: AEN drain failed, retrying.
> Jun 17 23:09:22 [kernel] 3w-xxxx: scsi2: AEN drain failed, retrying.
> Jun 18 17:33:03 [kernel] 3w-xxxx: scsi2: AEN drain failed, retrying.
> Jun 19 21:16:51 [kernel] 3w-xxxx: scsi2: AEN drain failed, retrying.
> Jun 20 00:29:29 [kernel] 3w-xxxx: scsi2: AEN drain failed, retrying.
> Jun 20 07:34:00 [kernel] 3w-xxxx: scsi2: WARNING: Unit #4: Command (0x28)
> timed out, resetting card.
> Jun 20 07:34:30 [kernel] 3w-xxxx: scsi2: AEN drain failed, retrying.
>
> --
> Regards,
> Christian Iversen
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Controller failing, driver not behaving nicely
2006-06-20 22:47 ` adam radford
@ 2006-06-21 1:08 ` Christian Iversen
0 siblings, 0 replies; 3+ messages in thread
From: Christian Iversen @ 2006-06-21 1:08 UTC (permalink / raw)
To: linux-scsi
On Wednesday 21 June 2006 00:47, adam radford wrote:
> Christian,
>
> 0x28 is the scsi opcode for READ_10, which means the command that
> failed was a read command.
Makes sense - that just happened to be what the kernel requested at the time.
> The driver is trying to reset the card, however it is failing the reset.
>
> "AEN drain failed, retrying." means your card is not responding.
I assume it's trying to flush all pending commands or somesuch? What is AEN?
> I would suggest reseating your card if you have moved it recently,
I have - but only because I had this problem before, and I tried to solve it
that way. So that's not going to work, I'm afraid.
> or the card could be dying in which case you should contact 3ware/AMCC
> support.
Yeah I feared that. Too bad, it was a good card :-(
(It's bought off of eBay, so I don't think there's any hope for a
replacement).
Thank you very much for the help!
Wait - how is it that a driver can't reset the card, but a hw reset always
can? Is there some more fundamental type of reset possible, than what the
driver currently tries? Or would that require resetting everything?
It would be ultra-neat if the driver could just stall for a few seconds (or
even a minute). That'd be so much more bearable, compared to a terabyte-class
raid5 just suddenly failing.
Or at least, maybe the driver could try to reset the card a few more times?
Sometimes it seems to work after a couple of attempts.
--
Regards,
Christian Iversen
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2006-06-21 1:08 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-20 21:02 Controller failing, driver not behaving nicely Christian Iversen
2006-06-20 22:47 ` adam radford
2006-06-21 1:08 ` Christian Iversen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox