Re: [PATCH] scsi device recovery

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Bernd Schubert <bs@q-leap.de>
To: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Matthew Wilcox <matthew@wil.cx>,
	linux-scsi@vger.kernel.org, "Moore, Eric" <Eric.Moore@lsi.com>
Subject: Re: [PATCH] scsi device recovery
Date: Fri, 14 Dec 2007 16:26:59 +0100	[thread overview]
Message-ID: <200712141627.00024.bs@q-leap.de> (raw)
In-Reply-To: <1197642901.3154.79.camel@localhost.localdomain>

On Friday 14 December 2007 15:35:01 James Bottomley wrote:
> > > This is some type of ioc internal error.  What we do on DID_SOFT_ERROR
> > > is retry for the usual number of times up to the timeout limit.
> > > Unfortunately, the retries are fixed at SD_MAX_RETRIES in sd.c. 
> > > Without diagnosing what's going wrong in the fusion, it's impossible to
> > > say if this is reasonable, but your fusion is signalling ioc errors
> > > (firmware errors).
> >
> > besides this seems to be a fusion driver or firmware problem, I still
> > think eh is not activated for this error. I'm not absulutely sure, but I
> > think with my patch deh and later on eh would be triggered, wouldn't it?
>
> the full eh machinery, by design, isn't activated for a simple retry.
> If you look in scsi_lib.c:scsi_softirq_done() you'll see the processing
> of the outcome of scsi_decide_disposision() (DID_SOFT_ERROR comes out of
> here with NEEDS_RETRY, providing there are retries left).  Right at the
> moment, this means that the retry is absolutely immediate, so you
> probably run through all of the retries before firmware recovery even
> has time to activate.  I'd be amenable to giving it an ADD_TO_MLQUEUE
> type return (provided it still increments retries) which will cause a
> pause in the resubmission (until either a command returns or io pressure
> builds up in the block layer).

Isn't there always i/o pressure if the scsi bus is satturated? Can we activate 
eh machinery when retries is exceeded? 


Index: linux-2.6.22/drivers/scsi/scsi_error.c
===================================================================
--- linux-2.6.22.orig/drivers/scsi/scsi_error.c	2007-12-14 15:53:48.000000000 
+0100
+++ linux-2.6.22/drivers/scsi/scsi_error.c	2007-12-14 15:58:27.000000000 +0100
@@ -1235,7 +1235,7 @@ int scsi_decide_disposition(struct scsi_
 		 * and not get stuck in a loop.
 		 */
 	case DID_SOFT_ERROR:
-		goto maybe_retry;
+		goto maybe_requeue;
 	case DID_IMM_RETRY:
 		return NEEDS_RETRY;
 
@@ -1342,6 +1342,24 @@ int scsi_decide_disposition(struct scsi_
 		 */
 		return SUCCESS;
 	}
+
+      maybe_requeue:
+
+	/* we requeue for retry because the error was retryable, and
+	 * the request was not marked fast fail.  Note that above,
+	 * even if the request is marked fast fail, we still requeue
+	 * for queue congestion conditions (QUEUE_FULL or BUSY) */
+	if ((++scmd->retries) <= scmd->allowed
+	    && !blk_noretry_request(scmd->request)) {
+		return ADD_TO_MLQUEUE;
+	} else {
+		/*
+		 * no more retries - report this one back to upper level.
+		 *
+		 * TODO: initiate full error recovery now?
+		 */
+		return SUCCESS;
+	}
 }
 
 /**


>
> > > > Full log attached.
> > > >
> > > > > immediate error with no eh intervention because it means that the
> > > > > target went away.  Handling this as a retryable error isn't an
> > > > > option because it will interfere with hotplug.
> > > >
> > > > Then we need a sysfs flag one can set to manually enable eh for these
> > > > devices on DID_NO_CONNECT.
> > >
> > > No, because that will seriously damage a lot of other systems.
> >
> > How would it, if we create a device specific sysfs parameter defaulting
> > to off? If you think users could activate it by accident, we could also
> > print a big warning when the paramter is read from userspace.
> > Furthermore, as far as I did understand you, DID_NO_CONNECT is only
> > required for hotplugging. But real scsi doesn't do automatic hotplugging,
> > does it?
>
> Yes, it does.  Most modern busses are hot plug aware and use
> DID_NO_CONNECT to signal target went away.  Even some SPI frames are
> quasi hotplug aware.
>
> > One
> > always needs to do it manually, e.g. with scsiadd or similar tools. So is
> > DID_NO_CONNECT really required for native scsi? If not, we also could
> > make the scsi-drivers to set a flag to activate eh on DID_NO_CONNECT.
>
> Just grep through the mid layer ... you'll see we use DID_NO_CONNECT on
> a host of other error conditions to force an immediate error as well.

I will do later on. I will also write a patch allowing error recovery for 
manually overridden devices.

[...]

> > > This looks like a genuine bug.  I missed the thread, since my email
> > > system went off line while I was on holiday for two weeks.  The
> > > symptoms look to be lost commands, but I can't see why from the traces.
> > >  There's a known bug where we can hang in domain validation because of
> > > a resource starvation issue, but I know of none where everything hangs
> > > just after error recovery completes.
> >
> > Since still not much happend to solve this bug, shall I create a bugzilla
> > entry?
>
> Sure ... on further analysis, it is the fusion DV resource starvation
> issue.  The email thread is here:
>
> http://marc.info/?t=118039577800004


Interesting thread, I don't understand the details yet, but I'm really curious 
if this can somehow also explain the *almost deadlock* we are seeing when we 
do md-resync at maximum device speed.


Thanks a lot for your help,
Bernd


-- 
Bernd Schubert
Q-Leap Networks GmbH

     prev parent reply	other threads:[~2007-12-14 15:27 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-12-12 12:54 [PATCH] scsi device recovery Bernd Schubert
2007-12-12 13:39 ` Matthew Wilcox
2007-12-12 14:36   ` Bernd Schubert
2007-12-12 15:59     ` James Bottomley
2007-12-12 17:54       ` Bernd Schubert
2007-12-13 14:18         ` James Bottomley
2007-12-14 11:26           ` fusion problem (was Re: [PATCH] scsi device recovery) Bernd Schubert
2007-12-14 12:04           ` [PATCH] scsi device recovery Bernd Schubert
2007-12-14 12:22             ` Matthew Wilcox
2007-12-14 12:28               ` Bernd Schubert
2007-12-14 14:35             ` James Bottomley
2007-12-14 15:26               ` Bernd Schubert [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200712141627.00024.bs@q-leap.de \
    --to=bs@q-leap.de \
    --cc=Eric.Moore@lsi.com \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=linux-scsi@vger.kernel.org \
    --cc=matthew@wil.cx \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.