All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bernd Schubert <bs@q-leap.de>
To: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Matthew Wilcox <matthew@wil.cx>, linux-scsi@vger.kernel.org
Subject: Re: [PATCH] scsi device recovery
Date: Wed, 12 Dec 2007 18:54:42 +0100	[thread overview]
Message-ID: <200712121854.42669.bs@q-leap.de> (raw)
In-Reply-To: <1197475177.4203.29.camel@localhost.localdomain>

[Hmm, resending since mail after more than 30min still not on the ML, maybe 
the attachment was too large? I have uploaded the log to 
http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/scsi/kern.log.1]

On Wednesday 12 December 2007 16:59:36 James Bottomley wrote:
> On Wed, 2007-12-12 at 15:36 +0100, Bernd Schubert wrote:
> > On Wednesday 12 December 2007 14:39:27 Matthew Wilcox wrote:
> > > On Wed, Dec 12, 2007 at 01:54:14PM +0100, Bernd Schubert wrote:
> > > > below is a patch introducing device recovery, trying to prevent i/o
> > > > errors when a DID_NO_CONNECT or SOFT_ERROR does happen.
> > >
> > > Why doesn't the regular scsi_eh do what you need?
> >
> > First of all, it is presently simply not called when the two errors above
> > do happen. This could be changed, of course.
>
> Erm, I think you'll find the error handler does activate on
> DID_SOFT_ERROR.  It causes a retry via the eh.  DID_NO_CONNECT is an

Dec  7 23:48:45 beo-96 kernel: [94605.297924] sd 2:0:5:0: [sdd] Result: 
hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK,SUGGEST_OK
Dec  7 23:48:45 beo-96 kernel: [94605.297932] end_request: I/O error, dev sdd, 
sector 7706802052
Dec  7 23:48:45 beo-96 kernel: [94605.297937] raid5:md5: read error not 
correctable (sector 871932472 on sdd3).

Full log attached.

> immediate error with no eh intervention because it means that the target
> went away.  Handling this as a retryable error isn't an option because
> it will interfere with hotplug.

Then we need a sysfs flag one can set to manually enable eh for these devices
on DID_NO_CONNECT. 

>
> > Secondly, I think scsi_eh is in most cases doing too much. We are
> > fighting with flaky Infortrend boxes here, and scsi_eh sometimes manages
> > to crash their scsi channels. In most cases it is sufficient to stall any
> > io to the device and then to resume.
>
> But that's basically the default behaviour of the error handler (stall
> then resume).
>
> > For most scsi devices one probably doesn't need a suspend time or it can
> > be very small, this still needs to become configurable via sysfs.
>
> You mean a wait time beyond what the error handler currently does
> (basically it waits for the quiesce, begins error handling and then
> sends a test unit ready when it finishes before restarting).

In deh just waits on the first error and then only does a DV. For 
these infortrend devices, thats mostly sufficient.

>
> > Thirdly, scsi_eh doesn't give up, in most cases, when the scsi channel of
> > a Infortrend box crashed, it tried forever to recover.
> > To improve this is still on my todo list.
>
> Could you send traces for this.  I thought the error handler had been
> fixed over the last few years always to terminate.  If there's a case
> where it doesn't, this needs fixing.

I'm attaching the syslog, this is 2.6.22 + additional printks, dump_stack()'s
and msleep()'s.
At 03:59:36 the system finally went into wait_for_completion(), similar
to the "everything in wait_for_completion, what is my system doing?" thread.


Thanks,
Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH

  reply	other threads:[~2007-12-12 17:54 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-12-12 12:54 [PATCH] scsi device recovery Bernd Schubert
2007-12-12 13:39 ` Matthew Wilcox
2007-12-12 14:36   ` Bernd Schubert
2007-12-12 15:59     ` James Bottomley
2007-12-12 17:54       ` Bernd Schubert [this message]
2007-12-13 14:18         ` James Bottomley
2007-12-14 11:26           ` fusion problem (was Re: [PATCH] scsi device recovery) Bernd Schubert
2007-12-14 12:04           ` [PATCH] scsi device recovery Bernd Schubert
2007-12-14 12:22             ` Matthew Wilcox
2007-12-14 12:28               ` Bernd Schubert
2007-12-14 14:35             ` James Bottomley
2007-12-14 15:26               ` Bernd Schubert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200712121854.42669.bs@q-leap.de \
    --to=bs@q-leap.de \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=linux-scsi@vger.kernel.org \
    --cc=matthew@wil.cx \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.