From: Bernd Schubert <bs@q-leap.de>
To: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Matthew Wilcox <matthew@wil.cx>,
linux-scsi@vger.kernel.org, "Moore, Eric" <Eric.Moore@lsi.com>
Subject: Re: [PATCH] scsi device recovery
Date: Fri, 14 Dec 2007 16:26:59 +0100 [thread overview]
Message-ID: <200712141627.00024.bs@q-leap.de> (raw)
In-Reply-To: <1197642901.3154.79.camel@localhost.localdomain>
On Friday 14 December 2007 15:35:01 James Bottomley wrote:
> > > This is some type of ioc internal error. What we do on DID_SOFT_ERROR
> > > is retry for the usual number of times up to the timeout limit.
> > > Unfortunately, the retries are fixed at SD_MAX_RETRIES in sd.c.
> > > Without diagnosing what's going wrong in the fusion, it's impossible to
> > > say if this is reasonable, but your fusion is signalling ioc errors
> > > (firmware errors).
> >
> > besides this seems to be a fusion driver or firmware problem, I still
> > think eh is not activated for this error. I'm not absulutely sure, but I
> > think with my patch deh and later on eh would be triggered, wouldn't it?
>
> the full eh machinery, by design, isn't activated for a simple retry.
> If you look in scsi_lib.c:scsi_softirq_done() you'll see the processing
> of the outcome of scsi_decide_disposision() (DID_SOFT_ERROR comes out of
> here with NEEDS_RETRY, providing there are retries left). Right at the
> moment, this means that the retry is absolutely immediate, so you
> probably run through all of the retries before firmware recovery even
> has time to activate. I'd be amenable to giving it an ADD_TO_MLQUEUE
> type return (provided it still increments retries) which will cause a
> pause in the resubmission (until either a command returns or io pressure
> builds up in the block layer).
Isn't there always i/o pressure if the scsi bus is satturated? Can we activate
eh machinery when retries is exceeded?
Index: linux-2.6.22/drivers/scsi/scsi_error.c
===================================================================
--- linux-2.6.22.orig/drivers/scsi/scsi_error.c 2007-12-14 15:53:48.000000000
+0100
+++ linux-2.6.22/drivers/scsi/scsi_error.c 2007-12-14 15:58:27.000000000 +0100
@@ -1235,7 +1235,7 @@ int scsi_decide_disposition(struct scsi_
* and not get stuck in a loop.
*/
case DID_SOFT_ERROR:
- goto maybe_retry;
+ goto maybe_requeue;
case DID_IMM_RETRY:
return NEEDS_RETRY;
@@ -1342,6 +1342,24 @@ int scsi_decide_disposition(struct scsi_
*/
return SUCCESS;
}
+
+ maybe_requeue:
+
+ /* we requeue for retry because the error was retryable, and
+ * the request was not marked fast fail. Note that above,
+ * even if the request is marked fast fail, we still requeue
+ * for queue congestion conditions (QUEUE_FULL or BUSY) */
+ if ((++scmd->retries) <= scmd->allowed
+ && !blk_noretry_request(scmd->request)) {
+ return ADD_TO_MLQUEUE;
+ } else {
+ /*
+ * no more retries - report this one back to upper level.
+ *
+ * TODO: initiate full error recovery now?
+ */
+ return SUCCESS;
+ }
}
/**
>
> > > > Full log attached.
> > > >
> > > > > immediate error with no eh intervention because it means that the
> > > > > target went away. Handling this as a retryable error isn't an
> > > > > option because it will interfere with hotplug.
> > > >
> > > > Then we need a sysfs flag one can set to manually enable eh for these
> > > > devices on DID_NO_CONNECT.
> > >
> > > No, because that will seriously damage a lot of other systems.
> >
> > How would it, if we create a device specific sysfs parameter defaulting
> > to off? If you think users could activate it by accident, we could also
> > print a big warning when the paramter is read from userspace.
> > Furthermore, as far as I did understand you, DID_NO_CONNECT is only
> > required for hotplugging. But real scsi doesn't do automatic hotplugging,
> > does it?
>
> Yes, it does. Most modern busses are hot plug aware and use
> DID_NO_CONNECT to signal target went away. Even some SPI frames are
> quasi hotplug aware.
>
> > One
> > always needs to do it manually, e.g. with scsiadd or similar tools. So is
> > DID_NO_CONNECT really required for native scsi? If not, we also could
> > make the scsi-drivers to set a flag to activate eh on DID_NO_CONNECT.
>
> Just grep through the mid layer ... you'll see we use DID_NO_CONNECT on
> a host of other error conditions to force an immediate error as well.
I will do later on. I will also write a patch allowing error recovery for
manually overridden devices.
[...]
> > > This looks like a genuine bug. I missed the thread, since my email
> > > system went off line while I was on holiday for two weeks. The
> > > symptoms look to be lost commands, but I can't see why from the traces.
> > > There's a known bug where we can hang in domain validation because of
> > > a resource starvation issue, but I know of none where everything hangs
> > > just after error recovery completes.
> >
> > Since still not much happend to solve this bug, shall I create a bugzilla
> > entry?
>
> Sure ... on further analysis, it is the fusion DV resource starvation
> issue. The email thread is here:
>
> http://marc.info/?t=118039577800004
Interesting thread, I don't understand the details yet, but I'm really curious
if this can somehow also explain the *almost deadlock* we are seeing when we
do md-resync at maximum device speed.
Thanks a lot for your help,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
prev parent reply other threads:[~2007-12-14 15:27 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-12-12 12:54 [PATCH] scsi device recovery Bernd Schubert
2007-12-12 13:39 ` Matthew Wilcox
2007-12-12 14:36 ` Bernd Schubert
2007-12-12 15:59 ` James Bottomley
2007-12-12 17:54 ` Bernd Schubert
2007-12-13 14:18 ` James Bottomley
2007-12-14 11:26 ` fusion problem (was Re: [PATCH] scsi device recovery) Bernd Schubert
2007-12-14 12:04 ` [PATCH] scsi device recovery Bernd Schubert
2007-12-14 12:22 ` Matthew Wilcox
2007-12-14 12:28 ` Bernd Schubert
2007-12-14 14:35 ` James Bottomley
2007-12-14 15:26 ` Bernd Schubert [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200712141627.00024.bs@q-leap.de \
--to=bs@q-leap.de \
--cc=Eric.Moore@lsi.com \
--cc=James.Bottomley@hansenpartnership.com \
--cc=linux-scsi@vger.kernel.org \
--cc=matthew@wil.cx \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox