All of lore.kernel.org
 help / color / mirror / Atom feed
* arcmsr: cascading failures
@ 2012-06-12 15:22 David Liontooth
  0 siblings, 0 replies; only message in thread
From: David Liontooth @ 2012-06-12 15:22 UTC (permalink / raw)
  To: LKML, Nick Cheng


On a system with Areca 1680 cards, the failure of a single drive leads 
to a cascading failure of the other good drives:

The troubles starts with an arcmsr abort of the bad drive:

Jun 11 07:20:42 cartago kernel: arcmsr7: abort device command of scsi id 
= 1 lun = 3

Jun 11 07:20:42 cartago kernel: arcmsr: executing bus reset 
eh.....num_resets = 1, num_aborts = 16
Jun 11 07:20:42 cartago kernel: arcmsr7: executing hw bus reset .....
Jun 11 07:20:55 cartago kernel: arcmsr7: waiting for hw bus reset 
return, retry=0
Jun 11 07:23:05 cartago kernel: arcmsr7: waiting for hw bus reset 
return, retry=13
Jun 11 07:23:05 cartago kernel: arcmsr7: waiting for hw bus reset 
return, RETRY TERMINATED!!

Jun 11 07:23:05 cartago kernel: sd 7:0:1:3: Device offlined - not ready 
after error recovery
Jun 11 07:23:05 cartago kernel: sd 7:0:1:3: [sds] Unhandled error code

Jun 11 07:23:05 cartago kernel: sd 7:0:1:3: [sds]  Result: 
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jun 11 07:23:05 cartago kernel: sd 7:0:1:3: [sds] CDB: Write(10): 2a 00 
29 1d a5 8f 00 00 10 00
Jun 11 07:23:05 cartago kernel: end_request: I/O error, dev sds, sector 
689808783
Jun 11 07:23:05 cartago kernel: sd 7:0:1:3: rejecting I/O to offline device

Jun 11 07:23:05 cartago kernel: XFS (sds1): xfs_do_force_shutdown(0x2) 
called from line 891 of file fs/xfs/xfs_log.c.
Jun 11 07:23:05 cartago kernel: XFS (sds1): Log I/O Error Detected.  
Shutting down filesystem
Jun 11 07:23:05 cartago kernel: XFS (sds1): Please umount the filesystem 
and rectify the problem(s)
Jun 11 07:23:35 cartago kernel: XFS (sds1): xfs_log_force: error 5 returned.

So far, so good -- the drive has failed and been taken offline. The 
kernel handles this fine, though it would be nice to have a way to tell 
XFS to stop trying to shut down the file system -- it's a lost cause.

Now the trouble starts: arcmsr starts aborting device commands to other 
drives:

Jun 11 07:23:36 cartago kernel: arcmsr7: abort device command of scsi id 
= 3 lun = 1
Jun 11 07:23:40 cartago kernel: arcmsr7: abort device command of scsi id 
= 3 lun = 1
Jun 11 07:23:44 cartago kernel: arcmsr7: abort device command of scsi id 
= 1 lun = 5
Jun 11 07:23:48 cartago kernel: arcmsr: executing bus reset 
eh.....num_resets = 2, num_aborts = 19
Jun 11 07:23:48 cartago kernel: arcmsr: there is an  bus reset eh 
proceeding.......

Udev notices a drive is missing and starts removing disk references -- 
this is a good working drive:

Jun 11 07:27:29 cartago udevd[17693]: removing watch on '/dev/sde1'
Jun 11 07:27:29 cartago udevd[17693]: no reference left, remove 
'/dev/disk/by-id/scsi-2001b4d2070476002-part1'
Jun 11 07:27:29 cartago udevd[17693]: no reference left, remove 
'/dev/disk/by-label/2007_05'
Jun 11 07:27:29 cartago udevd[17693]: no reference left, remove 
'/dev/disk/by-path/pci-0000:06:00.0-scsi-0:0:1:2-part1'
Jun 11 07:27:29 cartago udevd[17693]: no reference left, remove 
'/dev/disk/by-uuid/34c6f15b-427a-4414-8840-7e227efae0e3'
Jun 11 07:27:29 cartago udevd[17693]: device node '/dev/sde1' has sticky 
bit set, skip removal
Jun 11 07:27:29 cartago udevd[17693]: passed -1 bytes to netlink monitor 
0x2611130
Jun 11 07:27:29 cartago udevd[17693]: seq 2126 processed with 0
Jun 11 07:27:29 cartago udevd[1427]: seq 2126 done with 0

XFS gets busy trying to shut down a good drive:

Jun 11 07:27:29 cartago kernel: XFS (sde1): xfs_do_force_shutdown(0x1) 
called from line 1052 of file fs/xfs/linux-2.6/xfs_b$
Jun 11 07:27:29 cartago kernel: XFS (sde1): I/O Error Detected. Shutting 
down filesystem
Jun 11 07:27:29 cartago kernel: XFS (sde1): Please umount the filesystem 
and rectify the problem(s)

arcmsr keeps trying to "recover" and spreads mayhem to all the other 
drives -- there's around 20 of them:

Jun 11 07:28:12 cartago kernel: arcmsr7: abort device command of scsi id 
= 3 lun = 1
Jun 11 07:28:16 cartago kernel: arcmsr: executing bus reset 
eh.....num_resets = 4, num_aborts = 21
Jun 11 07:28:26 cartago kernel: sd 7:0:1:5: Device offlined - not ready 
after error recovery
Jun 11 07:28:26 cartago kernel: sd 7:0:3:1: Device offlined - not ready 
after error recovery
Jun 11 07:28:26 cartago kernel: sd 7:0:3:1: Device offlined - not ready 
after error recovery
Jun 11 07:28:26 cartago kernel: sd 7:0:1:5: [sdg] Unhandled error code

Gory details available on request. The machine now will now not halt and 
requires a very expensive hard reset.

Is there some way to keep this from happening? A failed drive should be 
isolated and dropped; this is losing the whole army to save one dead man.

Cheers,
David

# modinfo arcmsr
filename:       /lib/modules/3.0.26/kernel/drivers/scsi/arcmsr/arcmsr.ko
version:        Driver Version 1.20.00.15 2010/08/05
license:        Dual BSD/GPL
description:    ARECA (ARC11xx/12xx/16xx/1880) SATA/SAS RAID Host Bus 
Adapter
author:         Nick Cheng <support@areca.com.tw>
srcversion:     F9D834593FE182DC013DBD0

# uname -a
Linux cartago 3.0.26 #1 SMP Sat Apr 14 15:04:35 PDT 2012 x86_64 GNU/Linux



^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2012-06-12 15:37 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-12 15:22 arcmsr: cascading failures David Liontooth

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.