From: "Andrew Kinney" <andykinney@advantagecom.net>
To: linux-scsi@vger.kernel.org
Subject: aacraid patch & driver version question
Date: Mon, 27 Sep 2004 22:39:26 -0700 [thread overview]
Message-ID: <4158969E.30368.39E09981@localhost> (raw)
Hello,
I'm new to this list, so if I'm asking in the wrong place or this is
answered in a FAQ somewhere, please kindly steer me in the right
direction.
I'm actively working through the infamous "aacraid: Host adapter
reset request. SCSI hang?" issue with two of our systems that have
the PERC3/DI controller (2.8.0[6092] firmware and 1.1-4[2323] driver)
and I found a post on the aacraid-devel list. Since that list has
been merged to this list, I have to ask the question here.
The symptoms of our issue stem from a single drive in our RAID 5
array not responding to a device reset command issued by the
controller after about 30 or so timed-out read/write scsi commands.
The drive activity light for that drive stays on solid throughout the
process, even long after all activity should have been completed.
Apparently, the whole time the controller is waiting for the drive to
respond and trying to reset the drive, the command queue of the
controller fills up and the OS is unable to get any response out of
the controller during this period. This is compounded by the time it
takes the controller to check for a hot spare (which we don't have)
When the OS can't get a response out of the controller after a
reasonable time (30 or 60 seconds - don't remember off hand - but our
controller took more than 78 seconds to finish marking the drive dead
according to the controller logs), it marks the controller dead and
this makes the entire array unusable, thus causing a crash since this
is our only controller and only container. It could be a bad drive,
but I don't think it is because we have a second identically
configured system that has trouble with the exact same drive ID in
its array and shows the same exact problem. In both systems, the
drive tests good after a reboot (hard reboot since the OS is
unresponsive).
Regardless of what the root trigger of the problem is (bad disk, bad
backplane, bad power supply, whatever - we're working with Dell on
that aspect), taking a RAID 5 container offline due to one disk
failing (and the 78 seconds it took the controller to determine that)
is a horribly ungraceful and counterintuitive thing to do, especially
for a system with only one controller and one container. I used RAID
5 so that a single drive failure doesn't crash the system. That's
what "degraded" mode and online rebuild is for. ;-)
At any rate, the email message below seems to describe our situation
and suggests that changes to the driver were going to be made to fix
the issue. Were the changes mentioned below ever implemented in the
aacraid driver? If so, what is the earliest aacraid driver version
that contains these changes?
Please reply to the list since I have that whitelisted in my server
side spam filtering. Messages to me directly may or may not make it
through the spam filters, though they usually do.
Thanks in advance.
Sincerely,
Andrew Kinney
President and
Chief Technology Officer
Advantagecom Networks, Inc.
http://www.advantagecom.net
On Wed, 27 Aug 2003, on aacraid-devel@dell.com Salyzyn, Mark wrote:
> I may have a root cause on this issue, even though I have not been able to
> duplicate it yet.
>
> There is code that does the following in the driver:
>
> scsicmd->result = DID_OK << 16 | COMMAND_COMPLETE << 8 |
> SAM_STAT_TASK_SET_FULL;
> aac_io_done(scsicmd);
> return -1;
>
> This is *wrong*, because the none zero return causes the system to hold the
> command in the queue due to the use of the new error handler, yet we have
> also completed the command as `BUSY' *and* as a result of the constraints of
> the aac_io_done call which relocks (on io_request_lock) the caller had to
> unlock leaving a hole that SMP machines fill. By dropping the result and
> done calls in these situations, and holding the locks in the caller of such
> routines, I believe we will close this hole.
>
> Thanks, in part, to Josef M?rs for pointing out this locking problem
> under SMP, serendipitously a day after I had noticed the other problem with
> the inaccurate busy return sequences in the code and started making the
> changes to investigate. Kill two birds with one stone.
>
> I will report back on my tests of these changes, but will need a volunteer
> with kernel compile experience to report on the success in resolving this
> issue in the field *please*.
next reply other threads:[~2004-09-28 5:38 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-09-28 5:39 Andrew Kinney [this message]
-- strict thread matches above, loose matches on Subject: below --
2004-09-28 12:32 aacraid patch & driver version question Salyzyn, Mark
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4158969E.30368.39E09981@localhost \
--to=andykinney@advantagecom.net \
--cc=linux-scsi@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).