arcmsr: cascading failures - David Liontooth

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: David Liontooth <lionteeth@cogweb.net>
To: LKML <linux-kernel@vger.kernel.org>, Nick Cheng <support@areca.com.tw>
Subject: arcmsr: cascading failures
Date: Tue, 12 Jun 2012 08:22:33 -0700	[thread overview]
Message-ID: <4FD75EB9.4090503@cogweb.net> (raw)

On a system with Areca 1680 cards, the failure of a single drive leads 
to a cascading failure of the other good drives:

The troubles starts with an arcmsr abort of the bad drive:

Jun 11 07:20:42 cartago kernel: arcmsr7: abort device command of scsi id 
= 1 lun = 3

Jun 11 07:20:42 cartago kernel: arcmsr: executing bus reset 
eh.....num_resets = 1, num_aborts = 16
Jun 11 07:20:42 cartago kernel: arcmsr7: executing hw bus reset .....
Jun 11 07:20:55 cartago kernel: arcmsr7: waiting for hw bus reset 
return, retry=0
Jun 11 07:23:05 cartago kernel: arcmsr7: waiting for hw bus reset 
return, retry=13
Jun 11 07:23:05 cartago kernel: arcmsr7: waiting for hw bus reset 
return, RETRY TERMINATED!!

Jun 11 07:23:05 cartago kernel: sd 7:0:1:3: Device offlined - not ready 
after error recovery
Jun 11 07:23:05 cartago kernel: sd 7:0:1:3: [sds] Unhandled error code

Jun 11 07:23:05 cartago kernel: sd 7:0:1:3: [sds]  Result: 
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jun 11 07:23:05 cartago kernel: sd 7:0:1:3: [sds] CDB: Write(10): 2a 00 
29 1d a5 8f 00 00 10 00
Jun 11 07:23:05 cartago kernel: end_request: I/O error, dev sds, sector 
689808783
Jun 11 07:23:05 cartago kernel: sd 7:0:1:3: rejecting I/O to offline device

Jun 11 07:23:05 cartago kernel: XFS (sds1): xfs_do_force_shutdown(0x2) 
called from line 891 of file fs/xfs/xfs_log.c.
Jun 11 07:23:05 cartago kernel: XFS (sds1): Log I/O Error Detected.  
Shutting down filesystem
Jun 11 07:23:05 cartago kernel: XFS (sds1): Please umount the filesystem 
and rectify the problem(s)
Jun 11 07:23:35 cartago kernel: XFS (sds1): xfs_log_force: error 5 returned.

So far, so good -- the drive has failed and been taken offline. The 
kernel handles this fine, though it would be nice to have a way to tell 
XFS to stop trying to shut down the file system -- it's a lost cause.

Now the trouble starts: arcmsr starts aborting device commands to other 
drives:

Jun 11 07:23:36 cartago kernel: arcmsr7: abort device command of scsi id 
= 3 lun = 1
Jun 11 07:23:40 cartago kernel: arcmsr7: abort device command of scsi id 
= 3 lun = 1
Jun 11 07:23:44 cartago kernel: arcmsr7: abort device command of scsi id 
= 1 lun = 5
Jun 11 07:23:48 cartago kernel: arcmsr: executing bus reset 
eh.....num_resets = 2, num_aborts = 19
Jun 11 07:23:48 cartago kernel: arcmsr: there is an  bus reset eh 
proceeding.......

Udev notices a drive is missing and starts removing disk references -- 
this is a good working drive:

Jun 11 07:27:29 cartago udevd[17693]: removing watch on '/dev/sde1'
Jun 11 07:27:29 cartago udevd[17693]: no reference left, remove 
'/dev/disk/by-id/scsi-2001b4d2070476002-part1'
Jun 11 07:27:29 cartago udevd[17693]: no reference left, remove 
'/dev/disk/by-label/2007_05'
Jun 11 07:27:29 cartago udevd[17693]: no reference left, remove 
'/dev/disk/by-path/pci-0000:06:00.0-scsi-0:0:1:2-part1'
Jun 11 07:27:29 cartago udevd[17693]: no reference left, remove 
'/dev/disk/by-uuid/34c6f15b-427a-4414-8840-7e227efae0e3'
Jun 11 07:27:29 cartago udevd[17693]: device node '/dev/sde1' has sticky 
bit set, skip removal
Jun 11 07:27:29 cartago udevd[17693]: passed -1 bytes to netlink monitor 
0x2611130
Jun 11 07:27:29 cartago udevd[17693]: seq 2126 processed with 0
Jun 11 07:27:29 cartago udevd[1427]: seq 2126 done with 0

XFS gets busy trying to shut down a good drive:

Jun 11 07:27:29 cartago kernel: XFS (sde1): xfs_do_force_shutdown(0x1) 
called from line 1052 of file fs/xfs/linux-2.6/xfs_b$
Jun 11 07:27:29 cartago kernel: XFS (sde1): I/O Error Detected. Shutting 
down filesystem
Jun 11 07:27:29 cartago kernel: XFS (sde1): Please umount the filesystem 
and rectify the problem(s)

arcmsr keeps trying to "recover" and spreads mayhem to all the other 
drives -- there's around 20 of them:

Jun 11 07:28:12 cartago kernel: arcmsr7: abort device command of scsi id 
= 3 lun = 1
Jun 11 07:28:16 cartago kernel: arcmsr: executing bus reset 
eh.....num_resets = 4, num_aborts = 21
Jun 11 07:28:26 cartago kernel: sd 7:0:1:5: Device offlined - not ready 
after error recovery
Jun 11 07:28:26 cartago kernel: sd 7:0:3:1: Device offlined - not ready 
after error recovery
Jun 11 07:28:26 cartago kernel: sd 7:0:3:1: Device offlined - not ready 
after error recovery
Jun 11 07:28:26 cartago kernel: sd 7:0:1:5: [sdg] Unhandled error code

Gory details available on request. The machine now will now not halt and 
requires a very expensive hard reset.

Is there some way to keep this from happening? A failed drive should be 
isolated and dropped; this is losing the whole army to save one dead man.

Cheers,
David

# modinfo arcmsr
filename:       /lib/modules/3.0.26/kernel/drivers/scsi/arcmsr/arcmsr.ko
version:        Driver Version 1.20.00.15 2010/08/05
license:        Dual BSD/GPL
description:    ARECA (ARC11xx/12xx/16xx/1880) SATA/SAS RAID Host Bus 
Adapter
author:         Nick Cheng <support@areca.com.tw>
srcversion:     F9D834593FE182DC013DBD0

# uname -a
Linux cartago 3.0.26 #1 SMP Sat Apr 14 15:04:35 PDT 2012 x86_64 GNU/Linux

                 reply	other threads:[~2012-06-12 15:37 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4FD75EB9.4090503@cogweb.net \
    --to=lionteeth@cogweb.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=support@areca.com.tw \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox