From: Scott Ransom <ransom@cfa.harvard.edu>
To: Adam Radford <aradford@3ware.com>
Cc: linux-kernel@vger.kernel.org, Scott Ransom <ransom@cfa.harvard.edu>
Subject: Re: 3ware Escalade problems
Date: Wed, 01 Aug 2001 14:51:22 -0400 [thread overview]
Message-ID: <3B684FAA.DB50B3FA@cfa.harvard.edu> (raw)
In-Reply-To: <53B208BD9A7FD311881A009027B6BBFB9EADC7@siamese>
Hi Adam,
The drives I am using are Maxtor 81.9G drives (model 98196H8).
I refuse to believe that 3 different disks could fail during the span of
3 days without _something_ causing it -- especially since things have
been working great since February or so. And if I hadn't heard at least
one of the drives scream in agony, I wouldn't have believed that any of
them were really failing... Is it possible that a bad drive could
affect other drives in some way?
Here is the first failure:
Jul 27 23:24:53 munin kernel: 3w-xxxx: tw_interrupt(): Bad response,
status = 0xc7, flags = 0x1b, unit = 0x3.
Jul 27 23:24:53 munin last message repeated 6 times
Jul 27 23:25:08 munin kernel: 3w-xxxx: tw_poll_status(): Flag 0x40000
not found.
Jul 27 23:25:08 munin kernel: 3w-xxxx: tw_aen_drain_queue(): No
attention interrupt for card 1
Jul 27 23:25:08 munin kernel: 3w-xxxx: tw_reset_sequence(): No attention
interrupt for card 1.
Jul 27 23:25:24 munin kernel: 3w-xxxx: tw_poll_status(): Flag 0x40000
not found.
Jul 27 23:25:24 munin kernel: 3w-xxxx: tw_aen_drain_queue(): No
attention interrupt for card 1
Jul 27 23:25:24 munin kernel: 3w-xxxx: tw_reset_sequence(): No attention
interrupt for card 1.
Jul 27 23:25:37 munin kernel: 3w-xxxx: tw_scsi_eh_reset(): Reset
succeeded for card 1.
Jul 27 23:25:47 munin kernel: 3w-xxxx: tw_interrupt(): Bad response,
status = 0xcf, flags = 0x0, unit = 0x3.
Jul 27 23:25:47 munin kernel: scsi: device set offline - not ready or
command retry failed after host reset: host 1 channel 0 id 3 lun 0
Jul 27 23:25:47 munin kernel: 3w-xxxx: tw_interrupt(): Bad response,
status = 0xcf, flags = 0x0, unit = 0x3.
Jul 27 23:25:47 munin kernel: scsi: device set offline - not ready or
command retry failed after host reset: host 1 channel 0 id 3 lun 0
Jul 27 23:25:47 munin kernel: 3w-xxxx: tw_interrupt(): Bad response,
status = 0xcf, flags = 0x0, unit = 0x3.
Jul 27 23:25:47 munin kernel: scsi: device set offline - not ready or
command retry failed after host reset: host 1 channel 0 id 3 lun 0
Jul 27 23:25:47 munin kernel: 3w-xxxx: tw_interrupt(): Bad response,
status = 0xcf, flags = 0x0, unit = 0x3.
Jul 27 23:25:47 munin kernel: scsi: device set offline - not ready or
command retry failed after host reset: host 1 channel 0 id 3 lun 0
Jul 27 23:25:47 munin kernel: 3w-xxxx: tw_interrupt(): Bad response,
status = 0xcf, flags = 0x0, unit = 0x3.
Jul 27 23:25:47 munin kernel: scsi: device set offline - not ready or
command retry failed after host reset: host 1 channel 0 id 3 lun 0
Jul 27 23:25:47 munin kernel: 3w-xxxx: tw_interrupt(): Bad response,
status = 0xcf, flags = 0x0, unit = 0x3.
Jul 27 23:25:47 munin kernel: scsi: device set offline - not ready or
command retry failed after host reset: host 1 channel 0 id 3 lun 0
Jul 27 23:25:47 munin kernel: 3w-xxxx: tw_interrupt(): Bad response,
status = 0xcf, flags = 0x0, unit = 0x3.
Jul 27 23:25:47 munin kernel: scsi: device set offline - not ready or
command retry failed after host reset: host 1 channel 0 id 3 lun 0
followed by a bunch of garbage looking like the following (don't know if
this came from the RAID code or the 3ware code or something else...
Jul 27 23:25:47 munin kernel: : D:0 D:0 :0 D: D:0 D:0 D:0 D: D:0 D:0
D:0 T:1:00> .01c967 65WD C 0sdIS,3DID5)SK6, K>: :0 0,<40:,S[d0)K<00
v N: N: N: N: N N N N DN:0 DN: N: N:****: el<4> drrrc>
Jul 27 23:25:47 munin kernel: **MP****da1>ck0ea
Jul 27 23:25:47 munin kernel: L5 S853 0: 1:6 2:1 3: DISK<N:6> :6
:6> 6: DI: 7:: 8: ::411: DISK<N:0:412:4>
Jul 27 23:25:47 munin kernel: <13:414:415:4>
Jul 27 23:25:47 munin kernel: 16:417:4>
Jul 27 23:25:47 munin kernel: 1:4>
Jul 27 23:25:47 munin kernel: <20:421:42:423:42:4>25:26:4IS>
Jul 27 23:25:47 munin kernel: 7 :a
Jul 27 23:25:47 munin kernel: 6
Jul 27 23:25:47 munin kernel: <d
Then a different disk "failure" a couple days later...
Jul 31 19:21:16 munin kernel: 3w-xxxx: tw_interrupt(): Bad response,
status = 0xc7, flags = 0x51, unit = 0x4.
Jul 31 19:21:19 munin kernel: 3w-xxxx: tw_scsi_eh_reset(): Reset
succeeded for card 1.
Jul 31 19:21:33 munin kernel: 3w-xxxx: tw_interrupt(): Bad response,
status = 0xc7, flags = 0x51, unit = 0x4.
Jul 31 19:21:33 munin kernel: scsi: device set offline - not ready or
command retry failed after host reset: host 1 channel 0 id 4 lun 0
Jul 31 19:21:33 munin kernel: SCSI disk error : host 1 channel 0 id 4
lun 0 return code = 80000
Jul 31 19:21:33 munin kernel: I/O error: dev 08:41, sector 2362112
And finally a third "failure" today...
Aug 1 12:54:29 munin kernel: 3w-xxxx: tw_interrupt(): Bad response,
status = 0xc7, flags = 0x51, unit = 0x1.
Aug 1 12:54:32 munin kernel: 3w-xxxx: tw_scsi_eh_reset(): Reset
succeeded for card 1.
Aug 1 12:54:45 munin kernel: 3w-xxxx: tw_interrupt(): Bad response,
status = 0xc7, flags = 0x51, unit = 0x1.
Aug 1 12:54:45 munin kernel: scsi: device set offline - not ready or
command retry failed after host reset: host 1 channel 0 id 1 lun 0
Aug 1 12:54:45 munin kernel: SCSI disk error : host 1 channel 0 id 1
lun 0 return code = 80000
Aug 1 12:54:45 munin kernel: I/O error: dev 08:11, sector 158441712
Scott
> Adam Radford wrote:
>
> Scott,
>
> Several of the 'problems' users are seeing are due to bad IBM 75 Gig
> drives
> that had contamination during the manufacturing process. Lots of them
> have
> been recalled but some are still in use. Unfortunately, these drives
> give lots
> of ECC errors.
>
> The status=c7, flags=51, unit=0x1 means that the drive on unit 1
> (which is
> port 1 since you are using software raid) is showing ECC errors during
> reads.
>
> You didn't mention what kind of drives you have, but in either case,
> you need
> to replace that drive, IBM or not.
>
> --
> Adam Radford
> Software Engineer
> 3ware, Inc.
>
> -----Original Message-----
> From: Scott Ransom [mailto:ransom@cfa.harvard.edu]
> Sent: Wednesday, August 01, 2001 11:15 AM
> To: linux-kernel@vger.kernel.org; Scott Ransom
> Subject: 3ware Escalade problems
>
> Hello,
>
> After months of running a fileserver with an 8 port 3ware escalade
> card
> (kernels 2.4.[3457] using reiserfs and software RAID5) I started
> getting
> problems this weekend.
>
> Over the last three days, when I try to access the drives, after a
> couple minutes I get a drive failure (I even heard a "yelp" from the
> drive during one of them...). But the "failure" has happened to 3 of
> the 8 drives over 3 days -- so unless there is a hardware problem that
>
> is killing my drives I find it hard to believe that 3 drives really
> and
> truly failed....
>
> Here is a sample from my syslog of a failure:
>
> 3w-xxxx: tw_interrupt(): Bad response, status = 0xc7, flags = 0x51,
> unit
> = 0x1.
> 3w-xxxx: tw_scsi_eh_reset(): Reset succeeded for card 1.
> 3w-xxxx: tw_interrupt(): Bad response, status = 0xc7, flags = 0x51,
> unit
> = 0x1.
> scsi: device set offline - not ready or command retry failed after
> host
> reset: host 1 channel 0 id 1 lun 0
> SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 80000
> I/O error: dev 08:11, sector 158441712
>
> I've noticed several "issues" with the 3ware cards in the archives.
> Has
> anyone seen something like this?
>
> Scott
>
> PS: I'm currently running 2.4.7 with the lm-sensors/i2c patches.
>
> --
> Scott M. Ransom Address: Harvard-Smithsonian CfA
> Phone: (617) 496-7908 60 Garden St. MS 10
> email: ransom@cfa.harvard.edu Cambridge, MA 02138
> GPG Fingerprint: 06A9 9553 78BE 16DB 407B FFCA 9BFA B6FF FFD3 2989
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Scott M. Ransom Address: Harvard-Smithsonian CfA
Phone: (617) 496-7908 60 Garden St. MS 10
email: ransom@cfa.harvard.edu Cambridge, MA 02138
GPG Fingerprint: 06A9 9553 78BE 16DB 407B FFCA 9BFA B6FF FFD3 2989
next parent reply other threads:[~2001-08-01 18:51 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <53B208BD9A7FD311881A009027B6BBFB9EADC7@siamese>
2001-08-01 18:51 ` Scott Ransom [this message]
[not found] <53B208BD9A7FD311881A009027B6BBFB9EADCC@siamese>
2001-08-02 1:38 ` 3ware Escalade problems Jeff V. Merkey
2001-08-01 18:14 Scott Ransom
2001-08-01 21:39 ` Jeff V. Merkey
2001-08-02 0:26 ` Alan Cox
2001-08-02 1:40 ` Jeff V. Merkey
2001-08-02 0:40 ` Alan Cox
2001-08-02 1:58 ` Jeff V. Merkey
2001-08-02 12:22 ` Alan Cox
2001-08-02 21:26 ` Jeff V. Merkey
2001-08-02 22:02 ` Jeff V. Merkey
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3B684FAA.DB50B3FA@cfa.harvard.edu \
--to=ransom@cfa.harvard.edu \
--cc=aradford@3ware.com \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.