From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id ; Wed, 1 Aug 2001 14:51:45 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id ; Wed, 1 Aug 2001 14:51:26 -0400 Received: from presto.harvard.edu ([131.142.9.143]:63382 "HELO presto.harvard.edu") by vger.kernel.org with SMTP id ; Wed, 1 Aug 2001 14:51:14 -0400 Message-ID: <3B684FAA.DB50B3FA@cfa.harvard.edu> Date: Wed, 01 Aug 2001 14:51:22 -0400 From: Scott Ransom X-Mailer: Mozilla 4.77 [en] (X11; U; Linux 2.4.7 i686) X-Accept-Language: en MIME-Version: 1.0 To: Adam Radford CC: linux-kernel@vger.kernel.org, Scott Ransom Subject: Re: 3ware Escalade problems In-Reply-To: <53B208BD9A7FD311881A009027B6BBFB9EADC7@siamese> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Hi Adam, The drives I am using are Maxtor 81.9G drives (model 98196H8). I refuse to believe that 3 different disks could fail during the span of 3 days without _something_ causing it -- especially since things have been working great since February or so. And if I hadn't heard at least one of the drives scream in agony, I wouldn't have believed that any of them were really failing... Is it possible that a bad drive could affect other drives in some way? Here is the first failure: Jul 27 23:24:53 munin kernel: 3w-xxxx: tw_interrupt(): Bad response, status = 0xc7, flags = 0x1b, unit = 0x3. Jul 27 23:24:53 munin last message repeated 6 times Jul 27 23:25:08 munin kernel: 3w-xxxx: tw_poll_status(): Flag 0x40000 not found. Jul 27 23:25:08 munin kernel: 3w-xxxx: tw_aen_drain_queue(): No attention interrupt for card 1 Jul 27 23:25:08 munin kernel: 3w-xxxx: tw_reset_sequence(): No attention interrupt for card 1. Jul 27 23:25:24 munin kernel: 3w-xxxx: tw_poll_status(): Flag 0x40000 not found. Jul 27 23:25:24 munin kernel: 3w-xxxx: tw_aen_drain_queue(): No attention interrupt for card 1 Jul 27 23:25:24 munin kernel: 3w-xxxx: tw_reset_sequence(): No attention interrupt for card 1. Jul 27 23:25:37 munin kernel: 3w-xxxx: tw_scsi_eh_reset(): Reset succeeded for card 1. Jul 27 23:25:47 munin kernel: 3w-xxxx: tw_interrupt(): Bad response, status = 0xcf, flags = 0x0, unit = 0x3. Jul 27 23:25:47 munin kernel: scsi: device set offline - not ready or command retry failed after host reset: host 1 channel 0 id 3 lun 0 Jul 27 23:25:47 munin kernel: 3w-xxxx: tw_interrupt(): Bad response, status = 0xcf, flags = 0x0, unit = 0x3. Jul 27 23:25:47 munin kernel: scsi: device set offline - not ready or command retry failed after host reset: host 1 channel 0 id 3 lun 0 Jul 27 23:25:47 munin kernel: 3w-xxxx: tw_interrupt(): Bad response, status = 0xcf, flags = 0x0, unit = 0x3. Jul 27 23:25:47 munin kernel: scsi: device set offline - not ready or command retry failed after host reset: host 1 channel 0 id 3 lun 0 Jul 27 23:25:47 munin kernel: 3w-xxxx: tw_interrupt(): Bad response, status = 0xcf, flags = 0x0, unit = 0x3. Jul 27 23:25:47 munin kernel: scsi: device set offline - not ready or command retry failed after host reset: host 1 channel 0 id 3 lun 0 Jul 27 23:25:47 munin kernel: 3w-xxxx: tw_interrupt(): Bad response, status = 0xcf, flags = 0x0, unit = 0x3. Jul 27 23:25:47 munin kernel: scsi: device set offline - not ready or command retry failed after host reset: host 1 channel 0 id 3 lun 0 Jul 27 23:25:47 munin kernel: 3w-xxxx: tw_interrupt(): Bad response, status = 0xcf, flags = 0x0, unit = 0x3. Jul 27 23:25:47 munin kernel: scsi: device set offline - not ready or command retry failed after host reset: host 1 channel 0 id 3 lun 0 Jul 27 23:25:47 munin kernel: 3w-xxxx: tw_interrupt(): Bad response, status = 0xcf, flags = 0x0, unit = 0x3. Jul 27 23:25:47 munin kernel: scsi: device set offline - not ready or command retry failed after host reset: host 1 channel 0 id 3 lun 0 followed by a bunch of garbage looking like the following (don't know if this came from the RAID code or the 3ware code or something else... Jul 27 23:25:47 munin kernel: : D:0 D:0 :0 D: D:0 D:0 D:0 D: D:0 D:0 D:0 T:1:00> .01c967 65WD C 0sdIS,3DID5)SK6, K>: :0 0,<40:,S[d0)K<00 v N: N: N: N: N N N N DN:0 DN: N: N:****: el<4> drrrc> Jul 27 23:25:47 munin kernel: **MP****da1>ck0ea Jul 27 23:25:47 munin kernel: L5 S853 0: 1:6 2:1 3: DISK :6 :6> 6: DI: 7:: 8: ::411: DISK Jul 27 23:25:47 munin kernel: <13:414:415:4> Jul 27 23:25:47 munin kernel: 16:417:4> Jul 27 23:25:47 munin kernel: 1:4> Jul 27 23:25:47 munin kernel: <20:421:42:423:42:4>25:26:4IS> Jul 27 23:25:47 munin kernel: 7 :a Jul 27 23:25:47 munin kernel: 6 Jul 27 23:25:47 munin kernel: Adam Radford wrote: > > Scott, > > Several of the 'problems' users are seeing are due to bad IBM 75 Gig > drives > that had contamination during the manufacturing process. Lots of them > have > been recalled but some are still in use. Unfortunately, these drives > give lots > of ECC errors. > > The status=c7, flags=51, unit=0x1 means that the drive on unit 1 > (which is > port 1 since you are using software raid) is showing ECC errors during > reads. > > You didn't mention what kind of drives you have, but in either case, > you need > to replace that drive, IBM or not. > > -- > Adam Radford > Software Engineer > 3ware, Inc. > > -----Original Message----- > From: Scott Ransom [mailto:ransom@cfa.harvard.edu] > Sent: Wednesday, August 01, 2001 11:15 AM > To: linux-kernel@vger.kernel.org; Scott Ransom > Subject: 3ware Escalade problems > > Hello, > > After months of running a fileserver with an 8 port 3ware escalade > card > (kernels 2.4.[3457] using reiserfs and software RAID5) I started > getting > problems this weekend. > > Over the last three days, when I try to access the drives, after a > couple minutes I get a drive failure (I even heard a "yelp" from the > drive during one of them...). But the "failure" has happened to 3 of > the 8 drives over 3 days -- so unless there is a hardware problem that > > is killing my drives I find it hard to believe that 3 drives really > and > truly failed.... > > Here is a sample from my syslog of a failure: > > 3w-xxxx: tw_interrupt(): Bad response, status = 0xc7, flags = 0x51, > unit > = 0x1. > 3w-xxxx: tw_scsi_eh_reset(): Reset succeeded for card 1. > 3w-xxxx: tw_interrupt(): Bad response, status = 0xc7, flags = 0x51, > unit > = 0x1. > scsi: device set offline - not ready or command retry failed after > host > reset: host 1 channel 0 id 1 lun 0 > SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 80000 > I/O error: dev 08:11, sector 158441712 > > I've noticed several "issues" with the 3ware cards in the archives. > Has > anyone seen something like this? > > Scott > > PS: I'm currently running 2.4.7 with the lm-sensors/i2c patches. > > -- > Scott M. Ransom Address: Harvard-Smithsonian CfA > Phone: (617) 496-7908 60 Garden St. MS 10 > email: ransom@cfa.harvard.edu Cambridge, MA 02138 > GPG Fingerprint: 06A9 9553 78BE 16DB 407B FFCA 9BFA B6FF FFD3 2989 > - > To unsubscribe from this list: send the line "unsubscribe > linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Scott M. Ransom Address: Harvard-Smithsonian CfA Phone: (617) 496-7908 60 Garden St. MS 10 email: ransom@cfa.harvard.edu Cambridge, MA 02138 GPG Fingerprint: 06A9 9553 78BE 16DB 407B FFCA 9BFA B6FF FFD3 2989