Re: 3ware Escalade problems

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Scott Ransom <ransom@cfa.harvard.edu>
To: Adam Radford <aradford@3ware.com>
Cc: linux-kernel@vger.kernel.org, Scott Ransom <ransom@cfa.harvard.edu>
Subject: Re: 3ware Escalade problems
Date: Wed, 01 Aug 2001 14:51:22 -0400	[thread overview]
Message-ID: <3B684FAA.DB50B3FA@cfa.harvard.edu> (raw)
In-Reply-To: <53B208BD9A7FD311881A009027B6BBFB9EADC7@siamese>

Hi Adam,

The drives I am using are Maxtor 81.9G drives (model 98196H8).

I refuse to believe that 3 different disks could fail during the span of
3 days without _something_ causing it -- especially since things have
been working great since February or so.  And if I hadn't heard at least
one of the drives scream in agony, I wouldn't have believed that any of
them were really failing...  Is it possible that a bad drive could
affect other drives in some way?

Here is the first failure:

Jul 27 23:24:53 munin kernel: 3w-xxxx: tw_interrupt(): Bad response,
status = 0xc7, flags = 0x1b, unit = 0x3.
Jul 27 23:24:53 munin last message repeated 6 times
Jul 27 23:25:08 munin kernel: 3w-xxxx: tw_poll_status(): Flag 0x40000
not found.
Jul 27 23:25:08 munin kernel: 3w-xxxx: tw_aen_drain_queue(): No
attention interrupt for card 1
Jul 27 23:25:08 munin kernel: 3w-xxxx: tw_reset_sequence(): No attention
interrupt for card 1.
Jul 27 23:25:24 munin kernel: 3w-xxxx: tw_poll_status(): Flag 0x40000
not found.
Jul 27 23:25:24 munin kernel: 3w-xxxx: tw_aen_drain_queue(): No
attention interrupt for card 1
Jul 27 23:25:24 munin kernel: 3w-xxxx: tw_reset_sequence(): No attention
interrupt for card 1.
Jul 27 23:25:37 munin kernel: 3w-xxxx: tw_scsi_eh_reset(): Reset
succeeded for card 1.
Jul 27 23:25:47 munin kernel: 3w-xxxx: tw_interrupt(): Bad response,
status = 0xcf, flags = 0x0, unit = 0x3.
Jul 27 23:25:47 munin kernel: scsi: device set offline - not ready or
command retry failed after host reset: host 1 channel 0 id 3 lun 0
Jul 27 23:25:47 munin kernel: 3w-xxxx: tw_interrupt(): Bad response,
status = 0xcf, flags = 0x0, unit = 0x3.
Jul 27 23:25:47 munin kernel: scsi: device set offline - not ready or
command retry failed after host reset: host 1 channel 0 id 3 lun 0
Jul 27 23:25:47 munin kernel: 3w-xxxx: tw_interrupt(): Bad response,
status = 0xcf, flags = 0x0, unit = 0x3.
Jul 27 23:25:47 munin kernel: scsi: device set offline - not ready or
command retry failed after host reset: host 1 channel 0 id 3 lun 0
Jul 27 23:25:47 munin kernel: 3w-xxxx: tw_interrupt(): Bad response,
status = 0xcf, flags = 0x0, unit = 0x3.
Jul 27 23:25:47 munin kernel: scsi: device set offline - not ready or
command retry failed after host reset: host 1 channel 0 id 3 lun 0
Jul 27 23:25:47 munin kernel: 3w-xxxx: tw_interrupt(): Bad response,
status = 0xcf, flags = 0x0, unit = 0x3.
Jul 27 23:25:47 munin kernel: scsi: device set offline - not ready or
command retry failed after host reset: host 1 channel 0 id 3 lun 0
Jul 27 23:25:47 munin kernel: 3w-xxxx: tw_interrupt(): Bad response,
status = 0xcf, flags = 0x0, unit = 0x3.
Jul 27 23:25:47 munin kernel: scsi: device set offline - not ready or
command retry failed after host reset: host 1 channel 0 id 3 lun 0
Jul 27 23:25:47 munin kernel: 3w-xxxx: tw_interrupt(): Bad response,
status = 0xcf, flags = 0x0, unit = 0x3.
Jul 27 23:25:47 munin kernel: scsi: device set offline - not ready or
command retry failed after host reset: host 1 channel 0 id 3 lun 0

followed by a bunch of garbage looking like the following (don't know if
this came from the RAID code or the 3ware code or something else...

Jul 27 23:25:47 munin kernel:  : D:0 D:0 :0 D: D:0 D:0 D:0 D: D:0 D:0
D:0 T:1:00> .01c967 65WD C 0sdIS,3DID5)SK6,  K>: :0  0,<40:,S[d0)K<00
v   N:  N: N: N: N  N  N  N  DN:0  DN:   N: N:****: el<4> drrrc>
Jul 27 23:25:47 munin kernel: **MP****da1>ck0ea
Jul 27 23:25:47 munin kernel:      L5 S853 0: 1:6 2:1 3:  DISK<N:6> :6
:6> 6:  DI: 7:: 8: ::411:  DISK<N:0:412:4>
Jul 27 23:25:47 munin kernel: <13:414:415:4>
Jul 27 23:25:47 munin kernel:      16:417:4>
Jul 27 23:25:47 munin kernel:   1:4>
Jul 27 23:25:47 munin kernel: <20:421:42:423:42:4>25:26:4IS>
Jul 27 23:25:47 munin kernel: 7 :a
Jul 27 23:25:47 munin kernel: 6
Jul 27 23:25:47 munin kernel: <d

Then a different disk "failure" a couple days later...

Jul 31 19:21:16 munin kernel: 3w-xxxx: tw_interrupt(): Bad response,
status = 0xc7, flags = 0x51, unit = 0x4.
Jul 31 19:21:19 munin kernel: 3w-xxxx: tw_scsi_eh_reset(): Reset
succeeded for card 1.
Jul 31 19:21:33 munin kernel: 3w-xxxx: tw_interrupt(): Bad response,
status = 0xc7, flags = 0x51, unit = 0x4.
Jul 31 19:21:33 munin kernel: scsi: device set offline - not ready or
command retry failed after host reset: host 1 channel 0 id 4 lun 0
Jul 31 19:21:33 munin kernel: SCSI disk error : host 1 channel 0 id 4
lun 0 return code = 80000
Jul 31 19:21:33 munin kernel:  I/O error: dev 08:41, sector 2362112

And finally a third "failure" today... 

Aug  1 12:54:29 munin kernel: 3w-xxxx: tw_interrupt(): Bad response,
status = 0xc7, flags = 0x51, unit = 0x1.
Aug  1 12:54:32 munin kernel: 3w-xxxx: tw_scsi_eh_reset(): Reset
succeeded for card 1.
Aug  1 12:54:45 munin kernel: 3w-xxxx: tw_interrupt(): Bad response,
status = 0xc7, flags = 0x51, unit = 0x1.
Aug  1 12:54:45 munin kernel: scsi: device set offline - not ready or
command retry failed after host reset: host 1 channel 0 id 1 lun 0
Aug  1 12:54:45 munin kernel: SCSI disk error : host 1 channel 0 id 1
lun 0 return code = 80000
Aug  1 12:54:45 munin kernel:  I/O error: dev 08:11, sector 158441712


Scott


> Adam Radford wrote:
> 
> Scott,
> 
> Several of the 'problems' users are seeing are due to bad IBM 75 Gig
> drives
> that had contamination during the manufacturing process.  Lots of them
> have
> been recalled but some are still in use.  Unfortunately, these drives
> give lots
> of ECC errors.
> 
> The status=c7, flags=51, unit=0x1  means that the drive on unit 1
> (which is
> port 1 since you are using software raid) is showing ECC errors during
> reads.
> 
> You didn't mention what kind of drives you have, but in either case,
> you need
> to replace that drive, IBM or not.
> 
> --
> Adam Radford
> Software Engineer
> 3ware, Inc.
> 
> -----Original Message-----
> From: Scott Ransom [mailto:ransom@cfa.harvard.edu]
> Sent: Wednesday, August 01, 2001 11:15 AM
> To: linux-kernel@vger.kernel.org; Scott Ransom
> Subject: 3ware Escalade problems
> 
> Hello,
> 
> After months of running a fileserver with an 8 port 3ware escalade
> card
> (kernels 2.4.[3457] using reiserfs and software RAID5) I started
> getting
> problems this weekend.
> 
> Over the last three days, when I try to access the drives, after a
> couple minutes I get a drive failure (I even heard a "yelp" from the
> drive during one of them...).  But the "failure" has happened to 3 of
> the 8 drives over 3 days -- so unless there is a hardware problem that
> 
> is killing my drives I find it hard to believe that 3 drives really
> and
> truly failed....
> 
> Here is a sample from my syslog of a failure:
> 
> 3w-xxxx: tw_interrupt(): Bad response, status = 0xc7, flags = 0x51,
> unit
> = 0x1.
> 3w-xxxx: tw_scsi_eh_reset(): Reset succeeded for card 1.
> 3w-xxxx: tw_interrupt(): Bad response, status = 0xc7, flags = 0x51,
> unit
> = 0x1.
> scsi: device set offline - not ready or command retry failed after
> host
> reset: host 1 channel 0 id 1 lun 0
> SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 80000
> I/O error: dev 08:11, sector 158441712
> 
> I've noticed several "issues" with the 3ware cards in the archives.
> Has
> anyone seen something like this?
> 
> Scott
> 
> PS:  I'm currently running 2.4.7 with the lm-sensors/i2c patches.
> 
> --
> Scott M. Ransom                   Address:  Harvard-Smithsonian CfA
> Phone:  (617) 496-7908                      60 Garden St.  MS 10
> email:  ransom@cfa.harvard.edu              Cambridge, MA  02138
> GPG Fingerprint: 06A9 9553 78BE 16DB 407B  FFCA 9BFA B6FF FFD3 2989
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Scott M. Ransom                   Address:  Harvard-Smithsonian CfA
Phone:  (617) 496-7908                      60 Garden St.  MS 10 
email:  ransom@cfa.harvard.edu              Cambridge, MA  02138
GPG Fingerprint: 06A9 9553 78BE 16DB 407B  FFCA 9BFA B6FF FFD3 2989

next      parent reply	other threads:[~2001-08-01 18:51 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <53B208BD9A7FD311881A009027B6BBFB9EADC7@siamese>
2001-08-01 18:51 ` Scott Ransom [this message]
     [not found] <53B208BD9A7FD311881A009027B6BBFB9EADCC@siamese>
2001-08-02  1:38 ` 3ware Escalade problems Jeff V. Merkey
2001-08-01 18:14 Scott Ransom
2001-08-01 21:39 ` Jeff V. Merkey
2001-08-02  0:26   ` Alan Cox
2001-08-02  1:40     ` Jeff V. Merkey
2001-08-02  0:40       ` Alan Cox
2001-08-02  1:58         ` Jeff V. Merkey
2001-08-02 12:22           ` Alan Cox
2001-08-02 21:26             ` Jeff V. Merkey
2001-08-02 22:02               ` Jeff V. Merkey

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3B684FAA.DB50B3FA@cfa.harvard.edu \
    --to=ransom@cfa.harvard.edu \
    --cc=aradford@3ware.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.