From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michael Tokarev Subject: Re: aic7xxx or disk weirdness Date: Thu, 09 Dec 2004 14:23:32 +0300 Message-ID: <41B835B4.8030202@tls.msk.ru> References: <20041209030609.GA762@cm.nu> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from hobbit.corpit.ru ([81.13.94.6]:29537 "EHLO hobbit.corpit.ru") by vger.kernel.org with ESMTP id S261509AbULILXe (ORCPT ); Thu, 9 Dec 2004 06:23:34 -0500 In-Reply-To: <20041209030609.GA762@cm.nu> Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Shane Cc: linux-scsi@vger.kernel.org Shane wrote: > Hello list, > > A server I administer crashed today with some nasty logs > being left behind from the SCSI subsystem. I restarted the > system and it came up fine with all disks checking out ok > as far as smart extended tests and read tests are > concerned. However, I am not sure how to be certain the > disk is actually ok before re-adding it to the raid array > so any pointers would be appreciated. > > System info > > Dual p4 Xeon 2.6ghz > 2gb ECC RDRAM installed > 4x Seagate 15k rpm hard drives in a software raid5 > configuration > Small swap partitions are located on each drive. (I think > that was a mistake in this case) > Kernel 2.4.27-debian smp > > 14:54:35 continuum kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 10000 > 14:54:35 continuum kernel: I/O error: dev 08:11, sector 4895576 > 14:54:35 continuum kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 10000 > 14:54:35 continuum kernel: I/O error: dev 08:11, sector 4895912 > 14:54:36 continuum kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 10000 > 14:54:36 continuum kernel: I/O error: dev 08:11, sector 22635296 > 14:54:36 continuum kernel: SCSI disk error : host 0 channel 0 id 1 lun 0 return code = 10000 > 14:54:36 continuum kernel: I/O error: dev 08:11, sector 22635304 [] > 14:55:23 continuum kernel: scsi0:0:1:0: Attempting to queue an ABORT message > 14:55:23 continuum kernel: CDB: 0x2a 0x0 0x0 0x67 0x8e 0x1 0x0 0x0 0x28 0x0 [] > 14:55:23 continuum kernel: (scsi0:A:1:0): Device is disconnected, re-queuing SCB [] > 14:55:23 continuum kernel: scsi0:0:1:0: Attempting to queue a TARGET RESET message > 14:55:23 continuum kernel: CDB: 0x2a 0x0 0x0 0x67 0x8e 0x1 0x0 0x0 0x28 0x0 > 14:55:23 continuum kernel: scsi0:0:1:0: Is not an active device > 14:55:23 continuum kernel: aic7xxx_dev_reset returns 0x2002 > 14:55:23 continuum kernel: (scsi0:A:1): 160.000MB/s transfers (80.000MHz DT, offset 63, 16bit) Shane, the situation you're seeing is very similar to the one we're seeing here, also with seagate drives -- ST336607LW rev. 00007 in our case. The symptoms are: occasionally, the drive stops responding, just as there was no drive here. After several attempts to reset it and the bus, SCSI subsystem gaves up (like "scsi0:0:1:0: Is not an active device"), and switches to 160 MB/s transfers (from 320 MB/s) for the rest of devices on the bus. After a reboot (sometimes soft reboot is enouth, sometimes only hard reboot - ie power off/on), the "missing" drive is here again. When SCSI bus is in 160 or lower mode, the problem does not happen. I was able to find how to trigger the condition here: just run sformat -verify, and voila, the drive is "missing". To me, the whole problem looks like a bug in disk firmware, but all my attempts to contact seagate failed: "typical mishandling" they're answering, which means that the disk has been dropped to the floor, static elictricity, whatever... I'm certain there was NO such cases here, and again, when you can reproduce a problem on 100+ different drives (of the same model), that's.. something... For now, I don't have any solution for this problem, except of replacing the drives to ones from different vendor... /mjt