From: David Liontooth <lionteeth@cogweb.net>
To: LKML <linux-kernel@vger.kernel.org>, Nick Cheng <support@areca.com.tw>
Subject: arcmsr: cascading failures
Date: Tue, 12 Jun 2012 08:22:33 -0700 [thread overview]
Message-ID: <4FD75EB9.4090503@cogweb.net> (raw)
On a system with Areca 1680 cards, the failure of a single drive leads
to a cascading failure of the other good drives:
The troubles starts with an arcmsr abort of the bad drive:
Jun 11 07:20:42 cartago kernel: arcmsr7: abort device command of scsi id
= 1 lun = 3
Jun 11 07:20:42 cartago kernel: arcmsr: executing bus reset
eh.....num_resets = 1, num_aborts = 16
Jun 11 07:20:42 cartago kernel: arcmsr7: executing hw bus reset .....
Jun 11 07:20:55 cartago kernel: arcmsr7: waiting for hw bus reset
return, retry=0
Jun 11 07:23:05 cartago kernel: arcmsr7: waiting for hw bus reset
return, retry=13
Jun 11 07:23:05 cartago kernel: arcmsr7: waiting for hw bus reset
return, RETRY TERMINATED!!
Jun 11 07:23:05 cartago kernel: sd 7:0:1:3: Device offlined - not ready
after error recovery
Jun 11 07:23:05 cartago kernel: sd 7:0:1:3: [sds] Unhandled error code
Jun 11 07:23:05 cartago kernel: sd 7:0:1:3: [sds] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
Jun 11 07:23:05 cartago kernel: sd 7:0:1:3: [sds] CDB: Write(10): 2a 00
29 1d a5 8f 00 00 10 00
Jun 11 07:23:05 cartago kernel: end_request: I/O error, dev sds, sector
689808783
Jun 11 07:23:05 cartago kernel: sd 7:0:1:3: rejecting I/O to offline device
Jun 11 07:23:05 cartago kernel: XFS (sds1): xfs_do_force_shutdown(0x2)
called from line 891 of file fs/xfs/xfs_log.c.
Jun 11 07:23:05 cartago kernel: XFS (sds1): Log I/O Error Detected.
Shutting down filesystem
Jun 11 07:23:05 cartago kernel: XFS (sds1): Please umount the filesystem
and rectify the problem(s)
Jun 11 07:23:35 cartago kernel: XFS (sds1): xfs_log_force: error 5 returned.
So far, so good -- the drive has failed and been taken offline. The
kernel handles this fine, though it would be nice to have a way to tell
XFS to stop trying to shut down the file system -- it's a lost cause.
Now the trouble starts: arcmsr starts aborting device commands to other
drives:
Jun 11 07:23:36 cartago kernel: arcmsr7: abort device command of scsi id
= 3 lun = 1
Jun 11 07:23:40 cartago kernel: arcmsr7: abort device command of scsi id
= 3 lun = 1
Jun 11 07:23:44 cartago kernel: arcmsr7: abort device command of scsi id
= 1 lun = 5
Jun 11 07:23:48 cartago kernel: arcmsr: executing bus reset
eh.....num_resets = 2, num_aborts = 19
Jun 11 07:23:48 cartago kernel: arcmsr: there is an bus reset eh
proceeding.......
Udev notices a drive is missing and starts removing disk references --
this is a good working drive:
Jun 11 07:27:29 cartago udevd[17693]: removing watch on '/dev/sde1'
Jun 11 07:27:29 cartago udevd[17693]: no reference left, remove
'/dev/disk/by-id/scsi-2001b4d2070476002-part1'
Jun 11 07:27:29 cartago udevd[17693]: no reference left, remove
'/dev/disk/by-label/2007_05'
Jun 11 07:27:29 cartago udevd[17693]: no reference left, remove
'/dev/disk/by-path/pci-0000:06:00.0-scsi-0:0:1:2-part1'
Jun 11 07:27:29 cartago udevd[17693]: no reference left, remove
'/dev/disk/by-uuid/34c6f15b-427a-4414-8840-7e227efae0e3'
Jun 11 07:27:29 cartago udevd[17693]: device node '/dev/sde1' has sticky
bit set, skip removal
Jun 11 07:27:29 cartago udevd[17693]: passed -1 bytes to netlink monitor
0x2611130
Jun 11 07:27:29 cartago udevd[17693]: seq 2126 processed with 0
Jun 11 07:27:29 cartago udevd[1427]: seq 2126 done with 0
XFS gets busy trying to shut down a good drive:
Jun 11 07:27:29 cartago kernel: XFS (sde1): xfs_do_force_shutdown(0x1)
called from line 1052 of file fs/xfs/linux-2.6/xfs_b$
Jun 11 07:27:29 cartago kernel: XFS (sde1): I/O Error Detected. Shutting
down filesystem
Jun 11 07:27:29 cartago kernel: XFS (sde1): Please umount the filesystem
and rectify the problem(s)
arcmsr keeps trying to "recover" and spreads mayhem to all the other
drives -- there's around 20 of them:
Jun 11 07:28:12 cartago kernel: arcmsr7: abort device command of scsi id
= 3 lun = 1
Jun 11 07:28:16 cartago kernel: arcmsr: executing bus reset
eh.....num_resets = 4, num_aborts = 21
Jun 11 07:28:26 cartago kernel: sd 7:0:1:5: Device offlined - not ready
after error recovery
Jun 11 07:28:26 cartago kernel: sd 7:0:3:1: Device offlined - not ready
after error recovery
Jun 11 07:28:26 cartago kernel: sd 7:0:3:1: Device offlined - not ready
after error recovery
Jun 11 07:28:26 cartago kernel: sd 7:0:1:5: [sdg] Unhandled error code
Gory details available on request. The machine now will now not halt and
requires a very expensive hard reset.
Is there some way to keep this from happening? A failed drive should be
isolated and dropped; this is losing the whole army to save one dead man.
Cheers,
David
# modinfo arcmsr
filename: /lib/modules/3.0.26/kernel/drivers/scsi/arcmsr/arcmsr.ko
version: Driver Version 1.20.00.15 2010/08/05
license: Dual BSD/GPL
description: ARECA (ARC11xx/12xx/16xx/1880) SATA/SAS RAID Host Bus
Adapter
author: Nick Cheng <support@areca.com.tw>
srcversion: F9D834593FE182DC013DBD0
# uname -a
Linux cartago 3.0.26 #1 SMP Sat Apr 14 15:04:35 PDT 2012 x86_64 GNU/Linux
reply other threads:[~2012-06-12 15:37 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4FD75EB9.4090503@cogweb.net \
--to=lionteeth@cogweb.net \
--cc=linux-kernel@vger.kernel.org \
--cc=support@areca.com.tw \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.