aacraid patch & driver version question

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Andrew Kinney" <andykinney@advantagecom.net>
To: linux-scsi@vger.kernel.org
Subject: aacraid patch & driver version question
Date: Mon, 27 Sep 2004 22:39:26 -0700	[thread overview]
Message-ID: <4158969E.30368.39E09981@localhost> (raw)

Hello,

I'm new to this list, so if I'm asking in the wrong place or this is 
answered in a FAQ somewhere, please kindly steer me in the right 
direction.

I'm actively working through the infamous "aacraid: Host adapter 
reset request. SCSI hang?" issue with two of our systems that have 
the PERC3/DI controller (2.8.0[6092] firmware and 1.1-4[2323] driver) 
and I found a post on the aacraid-devel list.  Since that list has 
been merged to this list, I have to ask the question here.

The symptoms of our issue stem from a single drive in our RAID 5 
array not responding to a device reset command issued by the 
controller after about 30 or so timed-out read/write scsi commands. 
The drive activity light for that drive stays on solid throughout the 
process, even long after all activity should have been completed.  
Apparently, the whole time the controller is waiting for the drive to 
respond and trying to reset the drive, the command queue of the 
controller fills up and the OS is unable to get any response out of 
the controller during this period.  This is compounded by the time it 
takes the controller to check for a hot spare (which we don't have) 
When the OS can't get a response out of the controller after a 
reasonable time (30 or 60 seconds - don't remember off hand - but our 
controller took more than 78 seconds to finish marking the drive dead 
according to the controller logs), it marks the controller dead and 
this makes the entire array unusable, thus causing a crash since this 
is our only controller and only container.  It could be a bad drive, 
but I don't think it is because we have a second identically 
configured system that has trouble with the exact same drive ID in 
its array and shows the same exact problem.  In both systems, the 
drive tests good after a reboot (hard reboot since the OS is 
unresponsive).

Regardless of what the root trigger of the problem is (bad disk, bad 
backplane, bad power supply, whatever - we're working with Dell on 
that aspect), taking a RAID 5 container offline due to one disk 
failing (and the 78 seconds it took the controller to determine that) 
is a horribly ungraceful and counterintuitive thing to do, especially 
for a system with only one controller and one container.  I used RAID 
5 so that a single drive failure doesn't crash the system.  That's 
what "degraded" mode and online rebuild is for. ;-)

At any rate, the email message below seems to describe our situation 
and suggests that changes to the driver were going to be made to fix 
the issue.  Were the changes mentioned below ever implemented in the 
aacraid driver?  If so, what is the earliest aacraid driver version 
that contains these changes?

Please reply to the list since I have that whitelisted in my server 
side spam filtering. Messages to me directly may or may not make it 
through the spam filters, though they usually do.

Thanks in advance.

Sincerely,
Andrew Kinney
President and
Chief Technology Officer
Advantagecom Networks, Inc.
http://www.advantagecom.net

On Wed, 27 Aug 2003, on aacraid-devel@dell.com Salyzyn, Mark wrote:

> I may have a root cause on this issue, even though I have not been able to
> duplicate it yet.
>
> There is code that does the following in the driver:
>
> 	scsicmd->result = DID_OK << 16 | COMMAND_COMPLETE << 8 |
> SAM_STAT_TASK_SET_FULL;
> 	aac_io_done(scsicmd);
> 	return -1;
>
> This is *wrong*, because the none zero return causes the system to hold the
> command in the queue due to the use of the new error handler, yet we have
> also completed the command as `BUSY' *and* as a result of the constraints of
> the aac_io_done call which relocks (on io_request_lock) the caller had to
> unlock leaving a hole that SMP machines fill. By dropping the result and
> done calls in these situations, and holding the locks in the caller of such
> routines, I believe we will close this hole.
>
> Thanks, in part, to Josef M?rs for pointing out this locking problem
> under SMP, serendipitously a day after I had noticed the other problem with
> the inaccurate busy return sequences in the code and started making the
> changes to investigate. Kill two birds with one stone.
>
> I will report back on my tests of these changes, but will need a volunteer
> with kernel compile experience to report on the success in resolving this
> issue in the field *please*.

next             reply	other threads:[~2004-09-28  5:38 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-09-28  5:39 Andrew Kinney [this message]
  -- strict thread matches above, loose matches on Subject: below --
2004-09-28 12:32 aacraid patch & driver version question Salyzyn, Mark

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4158969E.30368.39E09981@localhost \
    --to=andykinney@advantagecom.net \
    --cc=linux-scsi@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).